This assignment will focus on creating three files – HTML, XML, and JSON – to be parsed into R dataframes based on attributes of three selected books of interest. Using the attributes selected such as title and author, create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”).
The goal of this assignment is to use R code and any R packages of choice to load the information from each of the three sources into separate R data frames.
This assignment requires the following R packages:
The code for this assignment can be found on GitHub here.
The three files, books.html
, books.xml
, and books.json
, can be found on GitHub using the link below:
Load the books.html file from GitHub:
# HTML file location on GitHub
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.html"
txt <- getURL(url=baseURL)
Below is the structure of the books.html
information in HTML format.
Parse the HTML table using htmlParse and Xpath
xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
# solution used below found on stackoverflow
# modified code to apply to books.html file
# http://stackoverflow.com/questions/6427061/parsing-html-tables-using-the-xml-rcurl-r-packages-without-using-the-readhtml
html_books <- as.data.frame(t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7, 9, 11, 13)])))
The resulting dataframe is shown below, but does not include column names. Additionally, all variables are defined as factors.
V1 | V2 | V3 | V4 | V5 | V6 | V7 |
---|---|---|---|---|---|---|
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining | Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis | John Wiley and Sons, Ltd | 978-1118834817 | 480 | Data Mining | English |
OpenIntro Statistics Second Edition | David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel | CreateSpace Independent Publishing Platform | 978-1478217206 | 426 | Probability / Statistics | English |
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan | John Kruschke | Academic Press | 978-0124058880 | 776 | Statistics | English |
## 'data.frame': 3 obs. of 7 variables:
## $ V1: Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 3 2
## $ V2: Factor w/ 3 levels "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel",..: 3 1 2
## $ V3: Factor w/ 3 levels "Academic Press",..: 3 2 1
## $ V4: Factor w/ 3 levels "978-0124058880",..: 2 3 1
## $ V5: Factor w/ 3 levels "426","480","776": 2 1 3
## $ V6: Factor w/ 3 levels "Data Mining",..: 1 2 3
## $ V7: Factor w/ 1 level "English": 1 1 1
Apply column names and convert factor variables to characters:
colnames(html_books) <- c("Title", "Authors", "Publisher", "ISBN", "Pages", "Topic", "Language")
# convert the factors to characters
# http://stackoverflow.com/questions/27528907/how-to-convert-data-frame-column-from-factor-to-numeric
indx <- sapply(html_books, is.factor)
html_books[indx] <- lapply(html_books[indx], function(x) as.character(x))
html_books$Pages <- as.numeric(html_books$Pages)
Title | Authors | Publisher | ISBN | Pages | Topic | Language |
---|---|---|---|---|---|---|
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining | Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis | John Wiley and Sons, Ltd | 978-1118834817 | 480 | Data Mining | English |
OpenIntro Statistics Second Edition | David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel | CreateSpace Independent Publishing Platform | 978-1478217206 | 426 | Probability / Statistics | English |
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan | John Kruschke | Academic Press | 978-0124058880 | 776 | Statistics | English |
Load the books.xml file from GitHub:
# XML file location on GitHub
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.xml"
txt <- getURL(url=baseURL)
Below is the structure of the books.xml
information in XML format.
1. Parse the XML using xmlParse and dply
xml_books <- xmlParse(txt, validate = F)
#http://www.informit.com/articles/article.aspx?p=2215520
books1 <- ldply(xmlToList(txt), data.frame)
str(books1)
## 'data.frame': 3 obs. of 9 variables:
## $ .id : chr "book" "book" "book"
## $ Title : Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 2 3
## $ Authors : Factor w/ 3 levels "Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis",..: 1 2 3
## $ Publisher: Factor w/ 3 levels "John Wiley and Sons, Ltd",..: 1 2 3
## $ ISBN : Factor w/ 3 levels "978-1118834817",..: 1 2 3
## $ Pages : Factor w/ 3 levels "480","426","776": 1 2 3
## $ Topic : Factor w/ 3 levels "Data Mining",..: 1 2 3
## $ Language : Factor w/ 1 level "English": 1 1 1
## $ .attrs : Factor w/ 3 levels "1","2","3": 1 2 3
This method of parsing the XML includes some additional fields in the resulting dataframe: .id
and .attrs
. These correspond to the book elements in the XML file such as
The resulting dataframe also contains all factors for the variables such as Title, Authors, Pages, etc. Convert factors to character variables.
Title | Authors | Publisher | ISBN | Pages | Topic | Language | Book.ID |
---|---|---|---|---|---|---|---|
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining | Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis | John Wiley and Sons, Ltd | 978-1118834817 | 480 | Data Mining | English | 1 |
OpenIntro Statistics Second Edition | David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel | CreateSpace Independent Publishing Platform | 978-1478217206 | 426 | Probability / Statistics | English | 2 |
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan | John Kruschke | Academic Press | 978-0124058880 | 776 | Statistics | English | 3 |
2. Parse the XML using xmlRoot and using xmlToDataFrame
This option seems somewhat simpler but does not include the option to include the id attribute from the book element. As with option 1, The resulting dataframe contains all factors for the variables such as Title, Authors, Pages, etc. Convert factors to character variables.
root <- xmlRoot(xml_books)
books2 <- xmlToDataFrame(root)
str(books2)
## 'data.frame': 3 obs. of 7 variables:
## $ Title : Factor w/ 3 levels "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining",..: 1 3 2
## $ Authors : Factor w/ 3 levels "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel",..: 3 1 2
## $ Publisher: Factor w/ 3 levels "Academic Press",..: 3 2 1
## $ ISBN : Factor w/ 3 levels "978-0124058880",..: 2 3 1
## $ Pages : Factor w/ 3 levels "426","480","776": 2 1 3
## $ Topic : Factor w/ 3 levels "Data Mining",..: 1 2 3
## $ Language : Factor w/ 1 level "English": 1 1 1
Title | Authors | Publisher | ISBN | Pages | Topic | Language |
---|---|---|---|---|---|---|
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining | Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis | John Wiley and Sons, Ltd | 978-1118834817 | 2 | Data Mining | English |
OpenIntro Statistics Second Edition | David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel | CreateSpace Independent Publishing Platform | 978-1478217206 | 1 | Probability / Statistics | English |
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan | John Kruschke | Academic Press | 978-0124058880 | 3 | Statistics | English |
Load the books.json file from GitHub:
# JSON file location on GitHub
baseURL <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/Week8/Data/books.json"
txt <- getURL(url=baseURL)
Below is the structure of the books.json
information in JSON format.
Parse the JSON file using the jsonlite
package.
json_books <- fromJSON(txt)
str(json_books)
## 'data.frame': 3 obs. of 7 variables:
## $ Title : chr "Automated Data Collection with R A practical Guide to Web Scraping and Text Mining" "OpenIntro Statistics Second Edition" "Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan"
## $ Authors : chr "Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis" "David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel" "John Kruschke"
## $ Publisher: chr "John Wiley and Sons, Ltd" "CreateSpace Independent Publishing Platform " "Academic Press"
## $ ISBN : chr "978-1118834817" "978-1478217206" "978-0124058880"
## $ Pages : chr "480" "426" "776"
## $ Topic : chr "Data Mining" "Probability / Statistics" "Statistics"
## $ Language : chr "English" "English" "English"
json_books$Pages <- as.numeric(json_books$Pages)
In this case, the variables were all loaded as characters instead of factors. This method using jsonlite
seemed to be the most straightforward and simplest approach.
The final dataframe looks like:
Title | Authors | Publisher | ISBN | Pages | Topic | Language |
---|---|---|---|---|---|---|
Automated Data Collection with R A practical Guide to Web Scraping and Text Mining | Simon Munzert, Christian Rubba, Peter MeiBner, Dominic Nyhuis | John Wiley and Sons, Ltd | 978-1118834817 | 480 | Data Mining | English |
OpenIntro Statistics Second Edition | David M Diez, Christopher D Barr, Mine Cetinkaya-Rundel | CreateSpace Independent Publishing Platform | 978-1478217206 | 426 | Probability / Statistics | English |
Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan | John Kruschke | Academic Press | 978-0124058880 | 776 | Statistics | English |