Data Scraping and Parsing : My Favorite Books
This assignment will take a data set of book data in 3 common formats - html, xml, and json.
Then it will parse into a R data frame and display it.
Each return set is different so standardizing it into a data frame requires some work.
| Format | Function | Package | Returns |
|---|---|---|---|
| HTML | readHTMLTable | XML | List of Lists |
| XML | xmlParse | XML | XMLInternalDocument |
| JSON | fromJSON | rjson | List of Lists |
The HTML table format looks like this.
<table>
<tr>
<th>Title</th>
<th>Year</th>
<th>Type</th>
<th>Authors</th>
</tr>
<tr>
<td>A Tree Grows in Brooklyn</td>
<td>1943</td>
<td>Fiction</td>
<td>Betty Smith</td>
</tr>
<tr>
<td>R For Dummies</td>
<td>2015</td>
<td>Non Fiction</td>
<td>Andrie de Vries</td><td>Joris Meys</td>
</tr>
<tr>
<td>The Martian</td>
<td>2011</td>
<td>Fiction</td>
<td>Andy Weir</td>
</tr>
</table>
Download HTML file. Parse into dataframe and display.
html_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.html"
destfile<-paste0(CURR_PATH,"/books.html")
download.file(html_file, destfile)
tables <- readHTMLTable(destfile)
books_html_df<-as.data.frame(tables[1])
colnames(books_html_df)<-c("Title", " Year", "Type", "Author1", "Author2")
kable(books_html_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)| Title | Year | Type | Author1 | Author2 |
|---|---|---|---|---|
| A Tree Grows in Brooklyn | 1943 | Fiction | Betty Smith | NA |
| R For Dummies | 2015 | Non Fiction | Andrie de Vries | Joris Meys |
| The Martian | 2011 | Fiction | Andy Weir | NA |
The XML format looks like this.
<books>
<book>
<title>A Tree Grows in Brooklyn</title>
<authors>
<author>Betty Smith</author>
</authors>
<year>1943</year>
<type>Fiction</type>
</book>
<book>
<title>R For Dummies</title>
<authors>
<author>Joris Meys</author>
<author>Andrie de Vries</author>
</authors>
<year>2015</year>
<type>Non Fiction</type>
</book>
<book>
<title>The Martian</title>
<authors>
<author>Andy Weir</author>
</authors>
<year>2011</year>
<type>Fiction</type>
</book>
</books>
Download XML file. Parse into dataframe and display.
xml_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.xml"
destfile<-paste0(CURR_PATH,"/books.xml")
download.file(xml_file, destfile)
xml_data<- xmlParse(file = destfile)
xml_df <- xmlToDataFrame(xml_data)
# class(xml_df)
# str(xml_df)
# xml_df
kable(xml_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)| title | authors | year | type |
|---|---|---|---|
| A Tree Grows in Brooklyn | Betty Smith | 1943 | Fiction |
| R For Dummies | Joris MeysAndrie de Vries | 2015 | Non Fiction |
| The Martian | Andy Weir | 2011 | Fiction |
The JSON format looks like this.
{"books":[
{ "title":"A Tree Grows in Brooklyn",
"authors" : [ "Betty Smith" ],
"year" : "1943",
"type" : "Fiction"
},
{ "title":"R For Dummies",
"authors" : [ "Joris Meys", "Andrie de Vries" ],
"year" : "2014",
"type" : "Non Fiction"
},
{ "title":"The Martian",
"authors" : [ "Andy Weir" ],
"year" : "2011",
"type" : "Fiction"
}
]}
Download JSON file. Parse into dataframe and display.
json_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.json"
destfile<-paste0(CURR_PATH,"/books.json")
download.file(json_file, destfile)
json_data <- fromJSON(file = "books.json")
# lets look at the structure
# class(json_data)
# str(json_data)
json_df<-as.data.frame(do.call(cbind, json_data$books))
json_df$V2<-as.character(json_df$V2)
json_transposed_df<-t(json_df)
kable(json_transposed_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)| title | authors | year | type |
|---|---|---|---|
| A Tree Grows in Brooklyn | Betty Smith | 1943 | Fiction |
| R For Dummies | c(“Joris Meys”, “Andrie de Vries”) | 2014 | Non Fiction |
| The Martian | Andy Weir | 2011 | Fiction |