Homework 7

Pick 3 of my favorite books and create html, xml, and json files detailing relevant information. Import these files using R and create a dataframe.

HTML

Getting the data was relatively easy using rvest.

url <- "https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Week%207/books.html"

html_bk <- url%>% read_html()%>%html_node("table")%>%html_table(fill = TRUE)

head(html_bk)
##     Book Title           Author        Additional Author
## 1         Dune    Frank Herbert Brian Herbert - Foreword
## 2   The Hobbit   J.R.R. Tolkien                     None
## 3 Ender's Game Orson Scott Card                     None
##                  Publisher Year           ISBN
## 1                      Ace 1990 978-0441172719
## 2 Houghton Mifflin Harcout 2012 978-0547928227
## 3                 Tor Teen 2014 978-0765378484
##                                                                                                                             Recent Film Adaptation
## 1                                                                                                 Dune(12/18/2020, Directed by Denis Villeneuve!!)
## 2 The Hobbit, film series (An Unexpected Journey(2012), The Desolation of Smaug(2013), The Battle of Five Armies(2014), Directed by Peter Jackson)
## 3                                                                                                   Ender's Game(10/24/13, Directed by Gavin Hood)
##                                                                                                                                                                                                                                                                                                       Favorite Quote
## 1 “I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.”
## 2                                                                                                                                                                                                                                                  The world is not in your books and maps, it’s out there – Gandalf
## 3                                                                                                                                                                                                                       “Early to bed and early to rise," Mazer intoned, "makes a man stupid and blind in the eyes."

XML

When I first attempted to pull the xml file i got an error stating that the xml contente does not seem to be xml. Looking the error online, i found that it was treating the url as an xml.

The data frame is messy but the information is still accessible.

url <- "https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Week%207/books.xml"

r = GET(url)

#Parse XML File
xmlbook <- xmlParse(r)

##The root variable contains all 3 book information. 
root <- xmlRoot(xmlbook)
root[1]
## $book
## <book>
##   <title>Dune</title>
##   <author>Frank Herbert</author>
##   <author>Brian Herbert</author>
##   <publisher>Ace</publisher>
##   <year>1990</year>
##   <isbn>978-0441172719</isbn>
##   <film>
##     <mv_title>Dune</mv_title>
##     <director>Denis Villeneneuve</director>
##     <mv_date>2020</mv_date>
##   </film>
##   <fav_quote>“I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.”</fav_quote>
## </book> 
## 
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
#root[2]
#root[3]


#Extract XML data:

data_xml <- xmlSApply(root,function(x) xmlSApply(x, xmlValue))

#Turn into data frame
books.frame <- data.frame(t(data_xml), row.names = FALSE)

books.frame[1:2,3]
## $book
##                                                                                            title 
##                                                                                   "Ender's Game" 
##                                                                                           author 
##                                                                               "Orson Scott Card" 
##                                                                                        publisher 
##                                                                                       "Tor Teen" 
##                                                                                             year 
##                                                                                           "2014" 
##                                                                                             isbn 
##                                                                                 "978-0765378484" 
##                                                                                             film 
##                                                                     "Ender's GameGavin Hood2013" 
##                                                                                        fav_quote 
## "“Early to bed and early to rise,\" Mazer intoned, \"makes a man stupid and blind in the eyes.”" 
## 
## $<NA>
## NULL
books.frame[1:2,2]
## $book
##                                                               title 
##                                                        "The Hobbit" 
##                                                              author 
##                                                    "J.R.R. Tolkien" 
##                                                           publisher 
##                                          "Houghton Mifflin Harcout" 
##                                                                year 
##                                                              "2012" 
##                                                                isbn 
##                                                    "978-0547928227" 
##                                                                film 
##                            "An Unexpected JourneyPeter Jackson2012" 
##                                                                film 
##                          "The Desolation of SmaugPeter Jackson2013" 
##                                                                film 
##                        "The Battle of Five ArmiesPeter Jackson2014" 
##                                                           fav_quote 
## "The world is not in your books and maps, it’s out there – Gandalf" 
## 
## $<NA>
## NULL
books.frame[1:2,1]
## $book
##                                                                                                                                                                                                                                                                                                                title 
##                                                                                                                                                                                                                                                                                                               "Dune" 
##                                                                                                                                                                                                                                                                                                               author 
##                                                                                                                                                                                                                                                                                                      "Frank Herbert" 
##                                                                                                                                                                                                                                                                                                               author 
##                                                                                                                                                                                                                                                                                                      "Brian Herbert" 
##                                                                                                                                                                                                                                                                                                            publisher 
##                                                                                                                                                                                                                                                                                                                "Ace" 
##                                                                                                                                                                                                                                                                                                                 year 
##                                                                                                                                                                                                                                                                                                               "1990" 
##                                                                                                                                                                                                                                                                                                                 isbn 
##                                                                                                                                                                                                                                                                                                     "978-0441172719" 
##                                                                                                                                                                                                                                                                                                                 film 
##                                                                                                                                                                                                                                                                                         "DuneDenis Villeneneuve2020" 
##                                                                                                                                                                                                                                                                                                            fav_quote 
## "“I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.”" 
## 
## $<NA>
## NULL

JSON

JSON was definitely more involved and required careful examination of the data in order to properly traverse through it and pull relevant information.

bookurl <- getURL("https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Week%207/books.json")

# Retrieve book information
books_js <- bookurl %>% enter_object("fav_books") %>% gather_array("bk") %>% spread_values(title = jstring("title"), author = jstring("author"), publisher = jstring("publisher"), year = jstring("year"), isbn = jstring("isbn"), quote = jstring("fav_quote"))

# Retrieve movie details
films <- bookurl %>% enter_object("fav_books") %>% gather_array("bk") %>% enter_object("film") %>% gather_array() %>% spread_values(film.title = jstring("name"),film.director = jstring("director"), film.year = jstring("year"))

#Merge all the information into a final table/data frame
Finalbooks_table <- books_js %>% left_join(films, by = "bk") %>% select(bk, title, author, year, isbn, quote, film.title, film.director, film.year)

Finalbooks_table
## Warning in as.character.tbl_json(.): attr(.,'JSON') has been removed from this
## tbl_json object
## # A tbl_json: 5 x 9 tibble with a "JSON" attribute
##   `attr(., "JSON"~    bk title author year  isbn  quote film.title film.director
##   <chr>            <int> <chr> <chr>  <chr> <chr> <chr> <chr>      <chr>        
## 1 <NA>                 1 Dune  Frank~ 1990  978-~ I mu~ Dune       Denis Villen~
## 2 <NA>                 2 The ~ J.R.R~ 2012  978-~ The ~ An Unexpe~ Peter Jackson
## 3 <NA>                 2 The ~ J.R.R~ 2012  978-~ The ~ The Desol~ Peter Jackson
## 4 <NA>                 2 The ~ J.R.R~ 2012  978-~ The ~ The Battl~ Peter Jackson
## 5 <NA>                 3 Ende~ Orson~ 2014  978-~ Earl~ Ender's G~ Gavin Hood   
## # ... with 1 more variable: film.year <chr>

Conclusion

The data frames are not identical and the methodology behind retrieving the data also isn’t the same. HTML is simple but if i didn’t have a second author column for “Dune”, the data would not be retreived as easily, the column values would be off by 1.

XML and JSON extacts all relevant information but is a lot “messier”. It requires a more extensive "clean up’ job.