Data Scraping and Parsing : My Favorite Books


Overview

This assignment will take a data set of book data in 3 common formats - html, xml, and json.


Then it will parse into a R data frame and display it.



Each return set is different so standardizing it into a data frame requires some work.



Format Function Package Returns
HTML readHTMLTable XML List of Lists
XML xmlParse XML XMLInternalDocument
JSON fromJSON rjson List of Lists




HTML





The HTML table format looks like this.


<table> 
 <tr>
    <th>Title</th>
    <th>Year</th>
    <th>Type</th>
    <th>Authors</th>
  </tr>
  <tr>
    <td>A Tree Grows in Brooklyn</td>
    <td>1943</td>
    <td>Fiction</td>
    <td>Betty Smith</td>
  </tr>
  
  <tr>
    <td>R For Dummies</td>
    <td>2015</td>
    <td>Non Fiction</td>
    <td>Andrie de Vries</td><td>Joris Meys</td>
  </tr>
  
  <tr>
    <td>The Martian</td>
    <td>2011</td>
    <td>Fiction</td>
    <td>Andy Weir</td>
  </tr>
</table>

 





Download HTML file. Parse into dataframe and display.

html_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.html"

destfile<-paste0(CURR_PATH,"/books.html")
download.file(html_file, destfile)
tables <- readHTMLTable(destfile)

books_html_df<-as.data.frame(tables[1])

colnames(books_html_df)<-c("Title", " Year", "Type", "Author1", "Author2")


kable(books_html_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)
My Favorite Books
Title Year Type Author1 Author2
A Tree Grows in Brooklyn 1943 Fiction Betty Smith NA
R For Dummies 2015 Non Fiction Andrie de Vries Joris Meys
The Martian 2011 Fiction Andy Weir NA




XML





The XML format looks like this.

<books>
  <book>
    <title>A Tree Grows in Brooklyn</title> 
    <authors>
      <author>Betty Smith</author>
    </authors>
  <year>1943</year>
   <type>Fiction</type>
 </book>
  <book>
    <title>R For Dummies</title> 
    <authors>
      <author>Joris Meys</author>
      <author>Andrie de Vries</author>
    </authors>
  <year>2015</year>
   <type>Non Fiction</type>
 </book>
  <book>
    <title>The Martian</title> 
    <authors>
      <author>Andy Weir</author>
    </authors>
  <year>2011</year>
   <type>Fiction</type>
 </book>
 </books> 
 





Download XML file. Parse into dataframe and display.

xml_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.xml"

destfile<-paste0(CURR_PATH,"/books.xml")
download.file(xml_file, destfile)

xml_data<- xmlParse(file = destfile)

xml_df <- xmlToDataFrame(xml_data)

# class(xml_df)
# str(xml_df)
# xml_df

kable(xml_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)
My Favorite Books
title authors year type
A Tree Grows in Brooklyn Betty Smith 1943 Fiction
R For Dummies Joris MeysAndrie de Vries 2015 Non Fiction
The Martian Andy Weir 2011 Fiction




JSON





The JSON format looks like this.

 
{"books":[
  { "title":"A Tree Grows in Brooklyn",
   "authors" : [ "Betty Smith" ],
   "year" : "1943",
   "type" : "Fiction"
    },
    { "title":"R For Dummies",
   "authors" : [  "Joris Meys", "Andrie de Vries" ],
   "year" : "2014",
   "type" : "Non Fiction"
    },
    { "title":"The Martian",
   "authors" : [ "Andy Weir" ],
   "year" : "2011",
   "type" : "Fiction"
    }
 
]}
 





Download JSON file. Parse into dataframe and display.

json_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.json"

destfile<-paste0(CURR_PATH,"/books.json")
download.file(json_file, destfile)
json_data <- fromJSON(file = "books.json")

# lets look at the structure
# class(json_data)
# str(json_data)


json_df<-as.data.frame(do.call(cbind, json_data$books))
json_df$V2<-as.character(json_df$V2)
json_transposed_df<-t(json_df)

kable(json_transposed_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)
My Favorite Books
title authors year type
A Tree Grows in Brooklyn Betty Smith 1943 Fiction
R For Dummies c(“Joris Meys”, “Andrie de Vries”) 2014 Non Fiction
The Martian Andy Weir 2011 Fiction