Pick 3 of my favorite books and create html, xml, and json files detailing relevant information. Import these files using R and create a dataframe.
Getting the data was relatively easy using rvest.
url <- "https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Week%207/books.html"
html_bk <- url%>% read_html()%>%html_node("table")%>%html_table(fill = TRUE)
head(html_bk)
## Book Title Author Additional Author
## 1 Dune Frank Herbert Brian Herbert - Foreword
## 2 The Hobbit J.R.R. Tolkien None
## 3 Ender's Game Orson Scott Card None
## Publisher Year ISBN
## 1 Ace 1990 978-0441172719
## 2 Houghton Mifflin Harcout 2012 978-0547928227
## 3 Tor Teen 2014 978-0765378484
## Recent Film Adaptation
## 1 Dune(12/18/2020, Directed by Denis Villeneuve!!)
## 2 The Hobbit, film series (An Unexpected Journey(2012), The Desolation of Smaug(2013), The Battle of Five Armies(2014), Directed by Peter Jackson)
## 3 Ender's Game(10/24/13, Directed by Gavin Hood)
## Favorite Quote
## 1 “I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.”
## 2 The world is not in your books and maps, it’s out there – Gandalf
## 3 “Early to bed and early to rise," Mazer intoned, "makes a man stupid and blind in the eyes."
When I first attempted to pull the xml file i got an error stating that the xml contente does not seem to be xml. Looking the error online, i found that it was treating the url as an xml.
The data frame is messy but the information is still accessible.
url <- "https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Week%207/books.xml"
r = GET(url)
#Parse XML File
xmlbook <- xmlParse(r)
##The root variable contains all 3 book information.
root <- xmlRoot(xmlbook)
root[1]
## $book
## <book>
## <title>Dune</title>
## <author>Frank Herbert</author>
## <author>Brian Herbert</author>
## <publisher>Ace</publisher>
## <year>1990</year>
## <isbn>978-0441172719</isbn>
## <film>
## <mv_title>Dune</mv_title>
## <director>Denis Villeneneuve</director>
## <mv_date>2020</mv_date>
## </film>
## <fav_quote>“I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.”</fav_quote>
## </book>
##
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
#root[2]
#root[3]
#Extract XML data:
data_xml <- xmlSApply(root,function(x) xmlSApply(x, xmlValue))
#Turn into data frame
books.frame <- data.frame(t(data_xml), row.names = FALSE)
books.frame[1:2,3]
## $book
## title
## "Ender's Game"
## author
## "Orson Scott Card"
## publisher
## "Tor Teen"
## year
## "2014"
## isbn
## "978-0765378484"
## film
## "Ender's GameGavin Hood2013"
## fav_quote
## "“Early to bed and early to rise,\" Mazer intoned, \"makes a man stupid and blind in the eyes.”"
##
## $<NA>
## NULL
books.frame[1:2,2]
## $book
## title
## "The Hobbit"
## author
## "J.R.R. Tolkien"
## publisher
## "Houghton Mifflin Harcout"
## year
## "2012"
## isbn
## "978-0547928227"
## film
## "An Unexpected JourneyPeter Jackson2012"
## film
## "The Desolation of SmaugPeter Jackson2013"
## film
## "The Battle of Five ArmiesPeter Jackson2014"
## fav_quote
## "The world is not in your books and maps, it’s out there – Gandalf"
##
## $<NA>
## NULL
books.frame[1:2,1]
## $book
## title
## "Dune"
## author
## "Frank Herbert"
## author
## "Brian Herbert"
## publisher
## "Ace"
## year
## "1990"
## isbn
## "978-0441172719"
## film
## "DuneDenis Villeneneuve2020"
## fav_quote
## "“I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.”"
##
## $<NA>
## NULL
JSON was definitely more involved and required careful examination of the data in order to properly traverse through it and pull relevant information.
bookurl <- getURL("https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Week%207/books.json")
# Retrieve book information
books_js <- bookurl %>% enter_object("fav_books") %>% gather_array("bk") %>% spread_values(title = jstring("title"), author = jstring("author"), publisher = jstring("publisher"), year = jstring("year"), isbn = jstring("isbn"), quote = jstring("fav_quote"))
# Retrieve movie details
films <- bookurl %>% enter_object("fav_books") %>% gather_array("bk") %>% enter_object("film") %>% gather_array() %>% spread_values(film.title = jstring("name"),film.director = jstring("director"), film.year = jstring("year"))
#Merge all the information into a final table/data frame
Finalbooks_table <- books_js %>% left_join(films, by = "bk") %>% select(bk, title, author, year, isbn, quote, film.title, film.director, film.year)
Finalbooks_table
## Warning in as.character.tbl_json(.): attr(.,'JSON') has been removed from this
## tbl_json object
## # A tbl_json: 5 x 9 tibble with a "JSON" attribute
## `attr(., "JSON"~ bk title author year isbn quote film.title film.director
## <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> 1 Dune Frank~ 1990 978-~ I mu~ Dune Denis Villen~
## 2 <NA> 2 The ~ J.R.R~ 2012 978-~ The ~ An Unexpe~ Peter Jackson
## 3 <NA> 2 The ~ J.R.R~ 2012 978-~ The ~ The Desol~ Peter Jackson
## 4 <NA> 2 The ~ J.R.R~ 2012 978-~ The ~ The Battl~ Peter Jackson
## 5 <NA> 3 Ende~ Orson~ 2014 978-~ Earl~ Ender's G~ Gavin Hood
## # ... with 1 more variable: film.year <chr>
The data frames are not identical and the methodology behind retrieving the data also isn’t the same. HTML is simple but if i didn’t have a second author column for “Dune”, the data would not be retreived as easily, the column values would be off by 1.
XML and JSON extacts all relevant information but is a lot “messier”. It requires a more extensive "clean up’ job.