Data Scraping and Parsing : My Favorite Books

Overview

This assignment will take a data set of book data in 3 common formats - html, xml, and json.

Then it will parse into a R data frame and display it.

Each return set is different so standardizing it into a data frame requires some work.

Format	Function	Package	Returns
HTML	readHTMLTable	XML	List of Lists
XML	xmlParse	XML	XMLInternalDocument
JSON	fromJSON	rjson	List of Lists

HTML

The HTML table format looks like this.


<table> 
 <tr>
    <th>Title</th>
    <th>Year</th>
    <th>Type</th>
    <th>Authors</th>
  </tr>
  <tr>
    <td>A Tree Grows in Brooklyn</td>
    <td>1943</td>
    <td>Fiction</td>
    <td>Betty Smith</td>
  </tr>
  
  <tr>
    <td>R For Dummies</td>
    <td>2015</td>
    <td>Non Fiction</td>
    <td>Andrie de Vries</td><td>Joris Meys</td>
  </tr>
  
  <tr>
    <td>The Martian</td>
    <td>2011</td>
    <td>Fiction</td>
    <td>Andy Weir</td>
  </tr>
</table>

Download HTML file. Parse into dataframe and display.

html_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.html"

destfile<-paste0(CURR_PATH,"/books.html")
download.file(html_file, destfile)
tables <- readHTMLTable(destfile)

books_html_df<-as.data.frame(tables[1])

colnames(books_html_df)<-c("Title", " Year", "Type", "Author1", "Author2")


kable(books_html_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)

My Favorite Books
Title	Year	Type	Author1	Author2
A Tree Grows in Brooklyn	1943	Fiction	Betty Smith	NA
R For Dummies	2015	Non Fiction	Andrie de Vries	Joris Meys
The Martian	2011	Fiction	Andy Weir	NA

XML

The XML format looks like this.

<books>
  <book>
    <title>A Tree Grows in Brooklyn</title> 
    <authors>
      <author>Betty Smith</author>
    </authors>
  <year>1943</year>
   <type>Fiction</type>
 </book>
  <book>
    <title>R For Dummies</title> 
    <authors>
      <author>Joris Meys</author>
      <author>Andrie de Vries</author>
    </authors>
  <year>2015</year>
   <type>Non Fiction</type>
 </book>
  <book>
    <title>The Martian</title> 
    <authors>
      <author>Andy Weir</author>
    </authors>
  <year>2011</year>
   <type>Fiction</type>
 </book>
 </books>

Download XML file. Parse into dataframe and display.

xml_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.xml"

destfile<-paste0(CURR_PATH,"/books.xml")
download.file(xml_file, destfile)

xml_data<- xmlParse(file = destfile)

xml_df <- xmlToDataFrame(xml_data)

# class(xml_df)
# str(xml_df)
# xml_df

kable(xml_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)

My Favorite Books
title	authors	year	type
A Tree Grows in Brooklyn	Betty Smith	1943	Fiction
R For Dummies	Joris MeysAndrie de Vries	2015	Non Fiction
The Martian	Andy Weir	2011	Fiction

JSON

The JSON format looks like this.

 
{"books":[
  { "title":"A Tree Grows in Brooklyn",
   "authors" : [ "Betty Smith" ],
   "year" : "1943",
   "type" : "Fiction"
    },
    { "title":"R For Dummies",
   "authors" : [  "Joris Meys", "Andrie de Vries" ],
   "year" : "2014",
   "type" : "Non Fiction"
    },
    { "title":"The Martian",
   "authors" : [ "Andy Weir" ],
   "year" : "2011",
   "type" : "Fiction"
    }
 
]}

Download JSON file. Parse into dataframe and display.

json_file<-"https://raw.githubusercontent.com/TheReallyBigApple/CunyAssignments/main/DATA607/MyFavoriteBooks.json"

destfile<-paste0(CURR_PATH,"/books.json")
download.file(json_file, destfile)
json_data <- fromJSON(file = "books.json")

# lets look at the structure
# class(json_data)
# str(json_data)


json_df<-as.data.frame(do.call(cbind, json_data$books))
json_df$V2<-as.character(json_df$V2)
json_transposed_df<-t(json_df)

kable(json_transposed_df, caption="My Favorite Books",row.names = FALSE, booktabs=TRUE)

My Favorite Books
title	authors	year	type
A Tree Grows in Brooklyn	Betty Smith	1943	Fiction
R For Dummies	c(“Joris Meys”, “Andrie de Vries”)	2014	Non Fiction
The Martian	Andy Weir	2011	Fiction

Data607_Week7

Tom Buonora

October 8, 2021

Overview

HTML

XML

JSON