Preparation:
sample raw html, xml, json are loaded to Github. However, files on github can not be downloaded directly because all the file reference are actually the “viewer” mode of the file, it will takes a lot of time to extract the data from the github viewer html code before disgesting the raw file. therefore, I upload all files to a webserver for this homework assigment. Azere is Microsoft web hosting service.
library("httr")
library("XML")
library(methods)
HTML FILE : http://data607.azurewebsites.net/data/books.html
# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
"data607.azurewebsites.net/data/books.html",
config(cainfo = cafile)
)
x <- content(page, as='text')
## No encoding supplied: defaulting to UTF-8.
xmlfile <- xmlParse(x)
# Exract the root node form the xml file.
rootnode <- xmlRoot(xmlfile)
For html file, not everything tag can be converted to xml. For example, <li> tag not be converted due to duplication. You ususally have more than one <li> for in each <ul> tag.
# Print the result.
df <- xmlToDataFrame( rootnode[[1]])
names(df)
## [1] "h3" "ul"
nrow(df)
## [1] 3
For specific element, such as second book Title in <h3> tag and book attribute in <ul> tag. More data extraction is required to work with HTML
df[2,1]
## [1] Title: Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 3 Levels: Title: Empires of the Sea: The Siege of Malta, the Battle of Lepanto, and the Contest for the Center of the World ...
df[2,2]
## [1] Author: Stephen J. Dubner, Steven D. LevittSubject: Economics, Personal FianceRating: 4.5
## 3 Levels: Author: Daniel Patterson, Mandy AftelSubject: CookbooksRating: 4.5 ...
XML file: http://data607.azurewebsites.net/data/books.xml
# Read page
page2 <- GET(
"data607.azurewebsites.net/data/books.xml",
config(cainfo = cafile)
)
#
x2 <- content(page2, as='text')
## No encoding supplied: defaulting to UTF-8.
For xml document, it is more eaiser to convert and tags are converted to columns
df2 <- xmlToDataFrame(x2)
names(df2)
## [1] "title" "author" "subject" "rating"
nrow(df2)
## [1] 3
for title and each element, we can just call it directly like usually dataframe syntax BUT if there is more than one value in each column such as authors, values may be merged together.
print(df2[2,]$title)
## [1] Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## 3 Levels: Empires of the Sea: The Siege of Malta, the Battle of Lepanto, and the Contest for the Center of the World ...
print(df2[2,]$author)
## [1] Stephen J. DubnerSteven D. Levitt
## 3 Levels: Daniel PattersonMandy Aftel ... Stephen J. DubnerSteven D. Levitt
print(df2[2,]$subject)
## [1] EconomicsPersonal Fiance
## Levels: Cookbooks EconomicsPersonal Fiance HistoryReligion
print(df2[2,]$rating)
## [1] 4.5
## Levels: 4.5
JSON file: http://data607.azurewebsites.net/data/books.json
library(rjson)
# Read page
page3 <- GET(
"data607.azurewebsites.net/data/books.json",
config(cainfo = cafile)
)
#
x3 <- content(page3, as='text')
## No encoding supplied: defaulting to UTF-8.
x3_json <- fromJSON(x3)
df3 <- as.data.frame(x3_json)
with rjson, each spreads each elements to columns and if there are more than one value in each column, it will automatically creaet a row. But we need to be careful because json file could be a TREE structure and data frame is like a table structure.
names(df3)
## [1] "books.title" "books.author" "books.subject"
## [4] "books.rating" "books.title.1" "books.author.1"
## [7] "books.subject.1" "books.rating.1" "books.title.2"
## [10] "books.author.2" "books.subject.2" "books.rating.2"
nrow(df3)
## [1] 2
With rjson library, file are converted, but the format of the data frame is not expaneded. For every attribution that has more than one value such as author, this library will duplicate all other columns.
df3$books.title.1
## [1] Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## [2] Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
## Levels: Freakonomics: A Rogue Economist Explores the Hidden Side of Everything
df3$books.author.1
## [1] Stephen J. Dubner Steven D. Levitt
## Levels: Stephen J. Dubner Steven D. Levitt
df3$books.subject.1
## [1] Economics Personal Fiance
## Levels: Economics Personal Fiance
df3$books.rating.1
## [1] 4.5 4.5
Summary: It seems that all 3 formats have weakness when it comes to tree structure data and format it to data frame. More data extraction is required. Or we may need to use different libraries or use different data structure to store the data.