In this week assignment we will explore techniques to explore parsing html, xml, and json files and extract the relevant content of such files. We will first create a file of for each extension containing information about 3 books.
The elements captures are as follows: title, author(s), ISBN, publication year.
It is a requirement that at least one of the books has multiple authors.
For reproducibility of results, the 3 files: books.html, books.xml, and books.json have been published to github and accessible via rawgit (see below):
Extract from FAQ from rawgit.com:
“RawGit acts as a caching proxy. It forwards requests to GitHub, caches the responses, and relays them to your browser with an appropriate Content-Type header based on the extension of the file that was requested. The caching layer ensures that minimal load is placed on GitHub, and you get quick and easy static file hosting right from a GitHub repo.”
We wil using the following packages:
RCurl,
XML,
jsonlite,
## Loading required package: bitops
## Warning: package 'jsonlite' was built under R version 3.2.4
First we will retrieve the URL for each of these files and then parse the content of each.
#HTML file books.html
html_url <- getURL("https://cdn.rawgit.com/vbriot28/datascienceCUNY_607/master/books.html")
books_parsed_html <- htmlParse(html_url)
#XML file books.xml
xml_url <- getURL("https://cdn.rawgit.com/vbriot28/datascienceCUNY_607/master/books.xml")
books_parsed_xml <- xmlParse(xml_url)
#JSON file books.json
json_url <- getURL("https://cdn.rawgit.com/vbriot28/datascienceCUNY_607/master/books.json")
books_parsed_json <- fromJSON(json_url)
We will display the structure of the object after parsing.
#HTML parsed results
str(books_parsed_html)
## Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
#XML parsed results
str(books_parsed_xml)
## Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
#JSON parsed results
str(books_parsed_json)
## List of 1
## $ data_books:'data.frame': 4 obs. of 4 variables:
## ..$ title : chr [1:4] "Data Architecture: A Primer for the Data Scientist Big Data, Data Warehouse and Data Vault" "Data Science from scratch: First Principals with Python" "R in a Nutshell" "R Graphics Cookbook"
## ..$ authors :List of 4
## .. ..$ : chr [1:2] "W. H. Inmon" "Daniel Linstedt"
## .. ..$ : chr "Joel Grus"
## .. ..$ : chr "Joseph Adler"
## .. ..$ : chr "Winston Chang"
## ..$ ISBN : chr [1:4] "978-0-12-802044-9" "978-1-491-90142-7" "978-1-449-31208-4" "978-1-449-31695-2"
## ..$ publishing_year: int [1:4] 2015 2015 2012 2013
As we observe, the parsing for the html and xml documents resulted into HTML and XML internal documents respectively, however, the parsing of the json file resulted into a list of a single data frame. The multiple authors for the first book, entered as an array in the json document, have been concatenated into a single entry. We will simply extract the element of the list and store it in a data frame. We will display the structure and the content.
#Construct Data frame for JSON file
df_books_json <- data.frame(books_parsed_json[1])
str(df_books_json)
## 'data.frame': 4 obs. of 4 variables:
## $ data_books.title : chr "Data Architecture: A Primer for the Data Scientist Big Data, Data Warehouse and Data Vault" "Data Science from scratch: First Principals with Python" "R in a Nutshell" "R Graphics Cookbook"
## $ data_books.authors :List of 4
## ..$ : chr "W. H. Inmon" "Daniel Linstedt"
## ..$ : chr "Joel Grus"
## ..$ : chr "Joseph Adler"
## ..$ : chr "Winston Chang"
## $ data_books.ISBN : chr "978-0-12-802044-9" "978-1-491-90142-7" "978-1-449-31208-4" "978-1-449-31695-2"
## $ data_books.publishing_year: int 2015 2015 2012 2013
head(df_books_json)
## data_books.title
## 1 Data Architecture: A Primer for the Data Scientist Big Data, Data Warehouse and Data Vault
## 2 Data Science from scratch: First Principals with Python
## 3 R in a Nutshell
## 4 R Graphics Cookbook
## data_books.authors data_books.ISBN
## 1 W. H. Inmon, Daniel Linstedt 978-0-12-802044-9
## 2 Joel Grus 978-1-491-90142-7
## 3 Joseph Adler 978-1-449-31208-4
## 4 Winston Chang 978-1-449-31695-2
## data_books.publishing_year
## 1 2015
## 2 2015
## 3 2012
## 4 2013
We will now extract the table in the parsed html document using a function from XML package: readHTMLTable. Again this function will provide us with a list a a single data frame. We will then extract the element of the list and store it in a data frame. We will display the structure and the con
## 'data.frame': 4 obs. of 4 variables:
## $ NULL.Title : Factor w/ 4 levels "Data Architecture: A Primer for the Data Scientist Big Data, Data Warehouse and Data Vault",..: 1 2 4 3
## $ NULL.Author.s. : Factor w/ 4 levels "Joel Grus","Joseph Adler",..: 3 1 2 4
## $ NULL.ISBN : Factor w/ 4 levels "978-0-12-802044-9",..: 1 4 2 3
## $ NULL.Publishing.Year: Factor w/ 3 levels "2012","2013",..: 3 3 1 2
## NULL.Title
## 1 Data Architecture: A Primer for the Data Scientist Big Data, Data Warehouse and Data Vault
## 2 Data Science from scratch: First Principals with Python
## 3 R in a Nutshell
## 4 R Graphics Cookbook
## NULL.Author.s. NULL.ISBN NULL.Publishing.Year
## 1 W. H. Inmon, Daniel Linstedt 978-0-12-802044-9 2015
## 2 Joel Grus 978-1-491-90142-7 2015
## 3 Joseph Adler 978-1-449-31208-4 2012
## 4 Winston Chang 978-1-449-31695-2 2013
We are now going to extract the data from the parsed xml document. Since we have 2 author tags for first book, we will extract each book node individually using xmlSApply on the Root node. We will concatenate the 2 authors to follow the format of the previous extraction (i.e. “author1, author2”). Once we have a 1 row data frame for each book, each data frame with same columns, we will collapse them all into one using rbind function.
## 'data.frame': 4 obs. of 4 variables:
## $ title :List of 4
## ..$ NA : chr "Data Architecture: A Primer for the Data Scientist Big Data, Data Warehouse and Data Vault"
## ..$ title: chr "Data Science from scratch: First Principals with Python"
## ..$ title: chr "R in a Nutshell"
## ..$ title: chr "R Graphics Cookbook"
## $ author :List of 4
## ..$ NA : chr "W. H. Inmon, Daniel Linstedt"
## ..$ author: chr "Joel Grus"
## ..$ author: chr "Joseph Adler"
## ..$ author: chr "Winston Chang"
## $ ISBN :List of 4
## ..$ NA : chr "978-0-12-802044-9"
## ..$ ISBN: chr "978-1-491-90142-7"
## ..$ ISBN: chr "978-1-449-31208-4"
## ..$ ISBN: chr "978-1-449-31695-2"
## $ publishing_year:List of 4
## ..$ NA : chr "2015"
## ..$ publishing_year: chr "2015"
## ..$ publishing_year: chr "2012"
## ..$ publishing_year: chr "2013"
## title
## 1 Data Architecture: A Primer for the Data Scientist Big Data, Data Warehouse and Data Vault
## 2 Data Science from scratch: First Principals with Python
## 3 R in a Nutshell
## 4 R Graphics Cookbook
## author ISBN publishing_year
## 1 W. H. Inmon, Daniel Linstedt 978-0-12-802044-9 2015
## 2 Joel Grus 978-1-491-90142-7 2015
## 3 Joseph Adler 978-1-449-31208-4 2012
## 4 Winston Chang 978-1-449-31695-2 2013
R packages provide powerful tools for Web content parsing and extraction but some analysis of the content must be done in order to select the best method/tool available.