First load the required packages:
library(RCurl)
library(XML)
library(RJSONIO)
library(stringr)
library(plyr)
The data containing 3 books has the following attributes:
titleeditionauthorisbn_10priceamazon_linkThe HTML, XML, and JSON files were constructed and stored blow:
I used getURL to fetch and htmlParse to parse the html document.
url_html <- "https://raw.githubusercontent.com/Tyllis/Data607/master/books.html"
books.html <- htmlParse(getURL(url_html))
books.html
## <!DOCTYPE html>
## <html>
## <head><title>Favorite Books</title></head>
## <body>
## <table>
## <tr align="left">
## <th>Title</th> <th>Edition</th> <th>Author</th> <th align="right">ISBN_10</th> <th align="right">Price</th> <th align="right">Amazon_link</th> </tr>
## <tr align="left">
## <td>Civil Engineering Reference Manual for the PE Exam</td> <td>15</td> <td>Michael R. Lindeburge</td> <td align="right">1591265088</td> <td align="right">$260.00</td> <td align="right"><a href="https://www.amazon.com/Civil-Engineering-Reference-Manual-Exam/dp/1591265088/ref=mt_hardcover?_encoding=UTF8&me=">click me</a></td> </tr>
## <tr align="left">
## <td>Mechanics of Materials</td> <td>7</td> <td>Ferdinand P. Beer & E. Russell Johnston Jr. & John T. DeWolf & David F. Mazurek</td> <td align="right">0073398233</td> <td align="right">$107.12</td> <td align="right"><a href="https://www.amazon.com/Mechanics-Materials-7th-Ferdinand-Beer/dp/0073398233/ref=sr_1_3?s=books&ie=UTF8&qid=1508013128&sr=1-3&keywords=mechanics+of+materials">click me</a></td> </tr>
## <tr align="left">
## <td>Data Mining: Applications in Civil Engineering</td> <td>NA</td> <td>Sandro Saitta & Raphael Benny & Ian Smith</td> <td align="right">3639207564</td> <td align="right">$67.47</td> <td align="right"><a href="https://www.amazon.com/Data-Mining-Applications-Civil-Engineering/dp/3639207564/ref=sr_1_9?s=books&ie=UTF8&qid=1508013381&sr=1-9&keywords=civil+engineering+machine+learning">click me</a></td> </tr>
## </table>
## </body>
## </html>
##
The attribute names are contained in the <th> nodes. This can be pulled out using xpathSApply:
col_name <- xpathSApply(doc = books.html, path = "//th", fun = xmlValue)
The values are contained in the <td> nodes. I used the numeric position predicates position() to pinpoint the values for each attribute.
For example, to pull out the book titles:
xpathSApply(doc = books.html, path = "//td[position()=1]", fun = xmlValue)
## [1] "Civil Engineering Reference Manual for the PE Exam"
## [2] "Mechanics of Materials"
## [3] "Data Mining: Applications in Civil Engineering"
I wrote a for-loop to extract values for each column, and added to the data.frame. Here, I also used the paste function to construct a path_string string to pass into the function:
books.data <- data.frame(c(1:3))
for (i in 1:length(col_name)){
path_string <- paste("//td[position()=", i, "]", sep = "")
books.data[,i] <- xpathSApply(doc = books.html, path = path_string, fun = xmlValue)
}
names(books.data) <- col_name
str(books.data)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Civil Engineering Reference Manual for the PE Exam" "Mechanics of Materials" "Data Mining: Applications in Civil Engineering"
## $ Edition : chr "15" "7" "NA"
## $ Author : chr "Michael R. Lindeburge" "Ferdinand P. Beer & E. Russell Johnston Jr. & John T. DeWolf & David F. Mazurek" "Sandro Saitta & Raphael Benny & Ian Smith"
## $ ISBN_10 : chr "1591265088" "0073398233" "3639207564"
## $ Price : chr "$260.00" "$107.12" "$67.47"
## $ Amazon_link: chr "click me" "click me" "click me"
Notice that the Amazon_link column contains just the string “click me”. This is because the values are actually contained in the attribute for the <a> note, intead of in the <td> node. I used the xmlGetAttr function to extract the links:
books.data$Amazon_link <- xpathSApply(doc = books.html, path = "//a", fun = xmlGetAttr, "href")
The extraction is now complete:
books.data
## Title Edition
## 1 Civil Engineering Reference Manual for the PE Exam 15
## 2 Mechanics of Materials 7
## 3 Data Mining: Applications in Civil Engineering NA
## Author
## 1 Michael R. Lindeburge
## 2 Ferdinand P. Beer & E. Russell Johnston Jr. & John T. DeWolf & David F. Mazurek
## 3 Sandro Saitta & Raphael Benny & Ian Smith
## ISBN_10 Price
## 1 1591265088 $260.00
## 2 0073398233 $107.12
## 3 3639207564 $67.47
## Amazon_link
## 1 https://www.amazon.com/Civil-Engineering-Reference-Manual-Exam/dp/1591265088/ref=mt_hardcover?_encoding=UTF8&me=
## 2 https://www.amazon.com/Mechanics-Materials-7th-Ferdinand-Beer/dp/0073398233/ref=sr_1_3?s=books&ie=UTF8&qid=1508013128&sr=1-3&keywords=mechanics+of+materials
## 3 https://www.amazon.com/Data-Mining-Applications-Civil-Engineering/dp/3639207564/ref=sr_1_9?s=books&ie=UTF8&qid=1508013381&sr=1-9&keywords=civil+engineering+machine+learning
Again, I used getURL and xmlParse to parse the xml file.
url_xml <- "https://raw.githubusercontent.com/Tyllis/Data607/master/books.xml"
books.xml <- xmlParse(getURL(url_xml))
books.xml
## <?xml version="1.0" encoding="ISO-8859-1"?>
## <favorite_books>
## <book id="1">
## <title>Civil Engineering Reference Manual for the PE Exam</title>
## <edition>15</edition>
## <author>Michael R. Lindeburge</author>
## <isbn_10>1591265088</isbn_10>
## <price>$260.00</price>
## <amazon_link><![CDATA[https://www.amazon.com/Civil-Engineering-Reference-Manual-Exam/dp/1591265088/ref=mt_hardcover?_encoding=UTF8&me=]]></amazon_link>
## </book>
## <book id="2">
## <title>Mechanics of Materials</title>
## <edition>7</edition>
## <author><![CDATA[Ferdinand P. Beer & E. Russell Johnston Jr. & John T. DeWolf & David F. Mazurek]]></author>
## <isbn_10>0073398233</isbn_10>
## <price>$107.12</price>
## <amazon_link><![CDATA[https://www.amazon.com/Mechanics-Materials-7th-Ferdinand-Beer/dp/0073398233/ref=sr_1_3?s=books&ie=UTF8&qid=1508013128&sr=1-3&keywords=mechanics+of+materials]]></amazon_link>
## </book>
## <book id="3">
## <title>Data Mining: Applications in Civil Engineering</title>
## <edition>NA</edition>
## <author><![CDATA[Sandro Saitta & Raphael Benny & Ian Smith]]></author>
## <isbn_10>3639207564</isbn_10>
## <price>$67.47</price>
## <amazon_link><![CDATA[https://www.amazon.com/Data-Mining-Applications-Civil-Engineering/dp/3639207564/ref=sr_1_9?s=books&ie=UTF8&qid=1508013381&sr=1-9&keywords=civil+engineering+machine+learning]]></amazon_link>
## </book>
## </favorite_books>
##
First, I attempted to pull out the column names using the xmlName function. Here, I just wanted to look at the child nodes under the <book id="1"> node. So I specified using the [@id] and child::\* predicates.
col_name <- xpathSApply(doc = books.xml, path = "//book[@id='1']/child::*", fun = xmlName)
Since XML files are well formed and structured, once column names are extracted, I simply just needed to run a for-loop to grab all the values for the columns. Again paste function was used, and I used col_name directly in the loop.
books.data <- data.frame(c(1:3))
for (i in col_name){
path_string <- paste("//", i, sep = "")
books.data[, i] <- xpathSApply(doc = books.xml, path = path_string, fun = xmlValue)
}
books.data
## c.1.3. title edition
## 1 1 Civil Engineering Reference Manual for the PE Exam 15
## 2 2 Mechanics of Materials 7
## 3 3 Data Mining: Applications in Civil Engineering NA
## author
## 1 Michael R. Lindeburge
## 2 Ferdinand P. Beer & E. Russell Johnston Jr. & John T. DeWolf & David F. Mazurek
## 3 Sandro Saitta & Raphael Benny & Ian Smith
## isbn_10 price
## 1 1591265088 $260.00
## 2 0073398233 $107.12
## 3 3639207564 $67.47
## amazon_link
## 1 https://www.amazon.com/Civil-Engineering-Reference-Manual-Exam/dp/1591265088/ref=mt_hardcover?_encoding=UTF8&me=
## 2 https://www.amazon.com/Mechanics-Materials-7th-Ferdinand-Beer/dp/0073398233/ref=sr_1_3?s=books&ie=UTF8&qid=1508013128&sr=1-3&keywords=mechanics+of+materials
## 3 https://www.amazon.com/Data-Mining-Applications-Civil-Engineering/dp/3639207564/ref=sr_1_9?s=books&ie=UTF8&qid=1508013381&sr=1-9&keywords=civil+engineering+machine+learning
Before loading, I check if the link provides a valid JSON. I also used htmlParse to check out the structure of the JSON file.
url_json <- "https://raw.githubusercontent.com/Tyllis/Data607/master/books.json"
isValidJSON(url_json)
## [1] TRUE
htmlParse(getURL(url_json))
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><p>{"favorite books" :[
## {
## "title" : "Civil Engineering Reference Manual for the PE Exam",
## "edition" : 15,
## "author" : ["Michael R. Lindeburge"],
## "isbn_10" : "1591265088",
## "price" : "$260.00",
## "amazon_link" : "https://www.amazon.com/Civil-Engineering-Reference-Manual-Exam/dp/1591265088/ref=mt_hardcover?_encoding=UTF8&me="
## },
## {
## "title" : "Mechanics of Materials",
## "edition" : 7,
## "author" : ["Ferdinand P. Beer", "E. Russell Johnston Jr.", "John T. DeWolf", "David F. Mazurek"],
## "isbn_10" : "0073398233",
## "price" : "$107.12",
## "amazon_link" : "https://www.amazon.com/Mechanics-Materials-7th-Ferdinand-Beer/dp/0073398233/ref=sr_1_3?s=books&ie=UTF8&qid=1508013128&sr=1-3&keywords=mechanics+of+materials"
## },
## {
## "title" : "Data Mining: Applications in Civil Engineering",
## "edition" : "NA",
## "author" : ["Sandro Saitta", "Raphael Benny", "Ian Smith"],
## "isbn_10" : "3639207564",
## "price" : "$67.47",
## "amazon_link" : "https://www.amazon.com/Data-Mining-Applications-Civil-Engineering/dp/3639207564/ref=sr_1_9?s=books&ie=UTF8&qid=1508013381&sr=1-9&keywords=civil+engineering+machine+learning"
## }]
## }</p></body></html>
##
I then used fromJSON to load the file:
books1 <- fromJSON(url_json)
str(books1)
## List of 1
## $ favorite books:List of 3
## ..$ :List of 6
## .. ..$ title : chr "Civil Engineering Reference Manual for the PE Exam"
## .. ..$ edition : num 15
## .. ..$ author : chr "Michael R. Lindeburge"
## .. ..$ isbn_10 : chr "1591265088"
## .. ..$ price : chr "$260.00"
## .. ..$ amazon_link: chr "https://www.amazon.com/Civil-Engineering-Reference-Manual-Exam/dp/1591265088/ref=mt_hardcover?_encoding=UTF8&me="
## ..$ :List of 6
## .. ..$ title : chr "Mechanics of Materials"
## .. ..$ edition : num 7
## .. ..$ author : chr [1:4] "Ferdinand P. Beer" "E. Russell Johnston Jr." "John T. DeWolf" "David F. Mazurek"
## .. ..$ isbn_10 : chr "0073398233"
## .. ..$ price : chr "$107.12"
## .. ..$ amazon_link: chr "https://www.amazon.com/Mechanics-Materials-7th-Ferdinand-Beer/dp/0073398233/ref=sr_1_3?s=books&ie=UTF8&qid=1508"| __truncated__
## ..$ :List of 6
## .. ..$ title : chr "Data Mining: Applications in Civil Engineering"
## .. ..$ edition : chr "NA"
## .. ..$ author : chr [1:3] "Sandro Saitta" "Raphael Benny" "Ian Smith"
## .. ..$ isbn_10 : chr "3639207564"
## .. ..$ price : chr "$67.47"
## .. ..$ amazon_link: chr "https://www.amazon.com/Data-Mining-Applications-Civil-Engineering/dp/3639207564/ref=sr_1_9?s=books&ie=UTF8&qid="| __truncated__
It produced a list containing 3 lists, in which each list contains 6 lists.
I used sapply to unlist the list.
books2 <- sapply(books1[[1]], unlist)
str(books2)
## List of 3
## $ : Named chr [1:6] "Civil Engineering Reference Manual for the PE Exam" "15" "Michael R. Lindeburge" "1591265088" ...
## ..- attr(*, "names")= chr [1:6] "title" "edition" "author" "isbn_10" ...
## $ : Named chr [1:9] "Mechanics of Materials" "7" "Ferdinand P. Beer" "E. Russell Johnston Jr." ...
## ..- attr(*, "names")= chr [1:9] "title" "edition" "author1" "author2" ...
## $ : Named chr [1:8] "Data Mining: Applications in Civil Engineering" "NA" "Sandro Saitta" "Raphael Benny" ...
## ..- attr(*, "names")= chr [1:8] "title" "edition" "author1" "author2" ...
The result is a list containing 3 vectors.
Next, I used lapply to perform transpose on each vector using t function:
books3 <- lapply(books2, t)
str(books3)
## List of 3
## $ : chr [1, 1:6] "Civil Engineering Reference Manual for the PE Exam" "15" "Michael R. Lindeburge" "1591265088" ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:6] "title" "edition" "author" "isbn_10" ...
## $ : chr [1, 1:9] "Mechanics of Materials" "7" "Ferdinand P. Beer" "E. Russell Johnston Jr." ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:9] "title" "edition" "author1" "author2" ...
## $ : chr [1, 1:8] "Data Mining: Applications in Civil Engineering" "NA" "Sandro Saitta" "Raphael Benny" ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:8] "title" "edition" "author1" "author2" ...
The result is a list containing 3 matrices.
I then used lapply and pass function to create data.frame for each of the 3 matrices:
books4 <- lapply(books3, data.frame, stringsAsFactors = F)
str(books4)
## List of 3
## $ :'data.frame': 1 obs. of 6 variables:
## ..$ title : chr "Civil Engineering Reference Manual for the PE Exam"
## ..$ edition : chr "15"
## ..$ author : chr "Michael R. Lindeburge"
## ..$ isbn_10 : chr "1591265088"
## ..$ price : chr "$260.00"
## ..$ amazon_link: chr "https://www.amazon.com/Civil-Engineering-Reference-Manual-Exam/dp/1591265088/ref=mt_hardcover?_encoding=UTF8&me="
## $ :'data.frame': 1 obs. of 9 variables:
## ..$ title : chr "Mechanics of Materials"
## ..$ edition : chr "7"
## ..$ author1 : chr "Ferdinand P. Beer"
## ..$ author2 : chr "E. Russell Johnston Jr."
## ..$ author3 : chr "John T. DeWolf"
## ..$ author4 : chr "David F. Mazurek"
## ..$ isbn_10 : chr "0073398233"
## ..$ price : chr "$107.12"
## ..$ amazon_link: chr "https://www.amazon.com/Mechanics-Materials-7th-Ferdinand-Beer/dp/0073398233/ref=sr_1_3?s=books&ie=UTF8&qid=1508"| __truncated__
## $ :'data.frame': 1 obs. of 8 variables:
## ..$ title : chr "Data Mining: Applications in Civil Engineering"
## ..$ edition : chr "NA"
## ..$ author1 : chr "Sandro Saitta"
## ..$ author2 : chr "Raphael Benny"
## ..$ author3 : chr "Ian Smith"
## ..$ isbn_10 : chr "3639207564"
## ..$ price : chr "$67.47"
## ..$ amazon_link: chr "https://www.amazon.com/Data-Mining-Applications-Civil-Engineering/dp/3639207564/ref=sr_1_9?s=books&ie=UTF8&qid="| __truncated__
The result is a list containg 3 data.frame.
Lastly, I used do.call to pass rbind.fill to combine the 3 data.frame into one final data.frame object.
booksjson.data <- do.call("rbind.fill", books4)
str(booksjson.data)
## 'data.frame': 3 obs. of 10 variables:
## $ title : chr "Civil Engineering Reference Manual for the PE Exam" "Mechanics of Materials" "Data Mining: Applications in Civil Engineering"
## $ edition : chr "15" "7" "NA"
## $ author : chr "Michael R. Lindeburge" NA NA
## $ isbn_10 : chr "1591265088" "0073398233" "3639207564"
## $ price : chr "$260.00" "$107.12" "$67.47"
## $ amazon_link: chr "https://www.amazon.com/Civil-Engineering-Reference-Manual-Exam/dp/1591265088/ref=mt_hardcover?_encoding=UTF8&me=" "https://www.amazon.com/Mechanics-Materials-7th-Ferdinand-Beer/dp/0073398233/ref=sr_1_3?s=books&ie=UTF8&qid=1508"| __truncated__ "https://www.amazon.com/Data-Mining-Applications-Civil-Engineering/dp/3639207564/ref=sr_1_9?s=books&ie=UTF8&qid="| __truncated__
## $ author1 : chr NA "Ferdinand P. Beer" "Sandro Saitta"
## $ author2 : chr NA "E. Russell Johnston Jr." "Raphael Benny"
## $ author3 : chr NA "John T. DeWolf" "Ian Smith"
## $ author4 : chr NA "David F. Mazurek" NA
Comparing the data.frame produced from the XML and HTML with one from the JSON, there is one major difference: the author attribute(s).
If a book has mutiple authors, the XML and HTML codes did not distinguish them into different nodes. Both codes used “&” to separate the authors and stored in one element, instead of as attributes. When parsing into DOM, the authors are treated as one string and loaded into one element in the column.
In JSON, authors are stored as a list element. When parsing, the fromJSON function automatically label the items in the list with numerical number, i.e. “author1”, “author2”, etc. As a result, the end result is a table listing authors in “wide” format. This is actually easiler to work with, as we can then use tidyr functions on the data.frame object right away.