library(kableExtra)
All three files are stored in a GitHub repository.
library(rvest)
library(dplyr)
htmlFile<-"https://raw.githubusercontent.com/pkowalchuk/Data607_Week7_Assignment/master/books.html"
htmlBooks<-tbl_df(as.data.frame(read_html(htmlFile) %>% html_table(header = NA, trim=TRUE, fill=TRUE)))
htmlBooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
| Tittle | Authors | Genres | Pages | AvailableInAudio |
|---|---|---|---|---|
| Algorithms to Live By: The Computer Science of Human Decisions | Brian Christian, Tom Griffiths | Science, Computer Science, Psychology, Non-fiction, Technology | 368 | Yes |
| String Theory | David Foster Wallace | Writing, Non-fiction, Essays, Science, Sports, Sports and Games | 138 | No |
| War and Peace | Leo Tolstoy | Novel | 1225 | Yes |
library(XML)
library(RCurl)
xmlfile<-getURL("https://raw.githubusercontent.com/pkowalchuk/Data607_Week7_Assignment/master/books.xml")
xmlBooks<-tbl_df(xmlToDataFrame(xmlfile))
xmlBooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
| Title | Authors | Genres | Pages | AvailableInAudio |
|---|---|---|---|---|
| Algorithms to Live By: The Computer Science of Human Decisions | Brian Christian,Tom Griffiths | Science,Computer Science,Psychology,Non-fiction,Technology, | 368 | Yes |
| String Theory | David Foster Wallace | Writing,Non-fiction,Essays,Science,Sports,Sports and Games | 138 | No |
| War and Peace | Leo Tolstoy | Novel | 1225 | Yes |
library(RJSONIO)
library(jsonlite)
jsonFile<-"https://raw.githubusercontent.com/pkowalchuk/Data607_Week7_Assignment/master/books.json"
isValidJSON(jsonFile)
## [1] TRUE
jsonbooks<-tbl_df(as.data.frame(fromJSON(jsonFile)))
jsonbooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
| books.Title | books.Authors | books.Genres | books.Pages | books.AvailableInAudio |
|---|---|---|---|---|
| Algorithms to Live By: The Computer Science of Human Decisions | c(“Brian Christian”, “Tom Griffiths”) | c(“Science”, “Computer Science”, “Psychology”, “Non-fiction”, “Technology”) | 368 | Yes |
| String Theory | David Foster Wallace | c(“Writing”, “Non-fiction”, “Essays”, “Science”, “Sports”, “Sports and Games”) | 138 | No |
| War and Peace | Leo Tolstoy | Novel | 1225 | Yes |
Although all three data frames contain the same information, they are not structured the exact same way. All three data tables hold the child data for Authors and Genres a little differently, but we can manipulate the data frames such that the result is the same.
In HTML the different authors and genres are stored in the table as item separated by comas. That is the same way this data comes into the data frame as a string where the authors are separated by comas. We could if necesary build a vector with each item holding an author or a genre. This might make working with the data in the data frame easier if operations with individual authors were required.
library(stringr)
htmlBooks$Authors[1]<-str_split(htmlBooks$Authors[1],", ")
htmlBooks$Genres[1]<-str_split(htmlBooks$Genres[1],", ")
htmlBooks$Genres[2]<-str_split(htmlBooks$Genres[2],", ")
htmlBooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
| Tittle | Authors | Genres | Pages | AvailableInAudio |
|---|---|---|---|---|
| Algorithms to Live By: The Computer Science of Human Decisions | c(“Brian Christian”, “Tom Griffiths”) | c(“Science”, “Computer Science”, “Psychology”, “Non-fiction”, “Technology”) | 368 | Yes |
| String Theory | David Foster Wallace | c(“Writing”, “Non-fiction”, “Essays”, “Science”, “Sports”, “Sports and Games”) | 138 | No |
| War and Peace | Leo Tolstoy | Novel | 1225 | Yes |
In XML all the data is stored as factors. Althought we might choose to work with the data in this format, we can unfactor the data to make it match what we have in HTML.
library(varhandle)
xmlBooks$Title<-unfactor(xmlBooks$Title)
xmlBooks$Authors<-unfactor(xmlBooks$Authors)
xmlBooks$Genres<-unfactor(xmlBooks$Genres)
xmlBooks$Pages<-as.integer(unfactor(xmlBooks$Pages))
xmlBooks$AvailableInAudio<-unfactor(xmlBooks$AvailableInAudio)
#we split the authors to have then in a vector
xmlBooks$Authors<-c(str_split(xmlBooks$Authors[1],","),xmlBooks$Authors[1],xmlBooks$Authors[2])
#we do the same for genres
xmlBooks$Genres<-c(str_split(xmlBooks$Genres[1],","),str_split(xmlBooks$Genres[2],","),xmlBooks$Genres[3])
xmlBooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
| Title | Authors | Genres | Pages | AvailableInAudio |
|---|---|---|---|---|
| Algorithms to Live By: The Computer Science of Human Decisions | c(“Brian Christian”, “Tom Griffiths”) | c(“Science”, “Computer Science”, “Psychology”, “Non-fiction”, “Technology”, “”) | 368 | Yes |
| String Theory | Brian Christian,Tom Griffiths | c(“Writing”, “Non-fiction”, “Essays”, “Science”, “Sports”, “Sports and Games”) | 138 | No |
| War and Peace | David Foster Wallace | Novel | 1225 | Yes |
In JSON we already have the Author and Genres columns of the data frame holding vectors for the different data items, so no further manipulation is required. For holding data this format seems the most efficient, with less overhead than XML, and geared towards storing and/or transmitting data rather than presenting it as in HTML. In JSON, we could change the names of the columns to match what we have in the other data frames. We can also change the Pages column to integers.
colnames(jsonbooks)<-c('Tittle','Authors','Genres','Pages','AvailableInAudio')
jsonbooks$Pages<-as.integer(jsonbooks$Pages)
jsonbooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
| Tittle | Authors | Genres | Pages | AvailableInAudio |
|---|---|---|---|---|
| Algorithms to Live By: The Computer Science of Human Decisions | c(“Brian Christian”, “Tom Griffiths”) | c(“Science”, “Computer Science”, “Psychology”, “Non-fiction”, “Technology”) | 368 | Yes |
| String Theory | David Foster Wallace | c(“Writing”, “Non-fiction”, “Essays”, “Science”, “Sports”, “Sports and Games”) | 138 | No |
| War and Peace | Leo Tolstoy | Novel | 1225 | Yes |
With these operations we have all three data frames with similar structures
htmlBooks
## # A tibble: 3 x 5
## Tittle Authors Genres Pages AvailableInAudio
## <chr> <list> <list> <int> <chr>
## 1 Algorithms to Live By: The Comp… <chr [2… <chr [… 368 Yes
## 2 String Theory <chr [1… <chr [… 138 No
## 3 War and Peace <chr [1… <chr [… 1225 Yes
xmlBooks
## # A tibble: 3 x 5
## Title Authors Genres Pages AvailableInAudio
## <chr> <list> <list> <int> <chr>
## 1 Algorithms to Live By: The Comp… <chr [2… <chr [… 368 Yes
## 2 String Theory <chr [1… <chr [… 138 No
## 3 War and Peace <chr [1… <chr [… 1225 Yes
jsonbooks
## # A tibble: 3 x 5
## Tittle Authors Genres Pages AvailableInAudio
## * <chr> <list> <list> <int> <chr>
## 1 Algorithms to Live By: The Comp… <chr [2… <chr [… 368 Yes
## 2 String Theory <chr [1… <chr [… 138 No
## 3 War and Peace <chr [1… <chr [… 1225 Yes