Installing and Loading necessary packages
#install.packages("XML")
#install.packages("jsonlite")
#install.packages("RJSONIO")
#install.packages("RCurl")
#install.packages("htmltab")
library(XML)
library(jsonlite)
library(RJSONIO)
##
## Attaching package: 'RJSONIO'
## The following objects are masked from 'package:jsonlite':
##
## fromJSON, toJSON
library(RCurl)
## Loading required package: bitops
library(httr)
library(plyr)
library(htmltab)
Load HTML file to a data frame
url_html <- c("https://raw.githubusercontent.com/humbertohpgit/MSDS1stSem/master/Books.html")
html_file <- getURL(url_html) ## RCurl package
html_df <- htmltab(doc = html_file)
## Argument 'which' was left unspecified. Choosing first table.
class(html_df)
## [1] "data.frame"
html_df
## Book Title Authors
## 2 Freakonomics Steven D. Levitt, Stephen J. Dubner
## 3 The Information James Gleick
## 4 Data Science for Business Foster Provost, Tom Fawcett
## Attribute1 Attribute2 Attribute3
## 2 Data Economics Analytics
## 3 Data Information Technology Evolution
## 4 Data and Business Analytic Thinking Innovation
Load XML file to a data frame
url_xml <- c("https://raw.githubusercontent.com/humbertohpgit/MSDS1stSem/master/Books.xml")
xml_file <- GET(url_xml) ## httr package
xml_data <- xmlParse(xml_file)
class(xml_data)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
topxml <- xmlRoot(xml_data)
topxml <- xmlSApply(topxml,
function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
row.names=NULL)
xml_df
## title authors
## 1 Freakonomics Steven D. Levitt, Stephen J. Dubner
## 2 The Information James Gleick
## 3 Data Science for Business Foster Provost, Tom Fawcett
## attribute1 attribute2 attribute3
## 1 Data Economics Analytics
## 2 Data Information Technology Evolution
## 3 Data and Business Analytic Thinking Innovation
Load JSON file to a data frame
isValidJSON("https://raw.githubusercontent.com/humbertohpgit/MSDS1stSem/master/Books.json")
## [1] TRUE
json_data <- fromJSON(content = "https://raw.githubusercontent.com/humbertohpgit/MSDS1stSem/master/Books.json")
class(json_data)
## [1] "list"
json_df <- do.call("rbind", lapply(json_data[[1]], data.frame, stringsAsFactors = FALSE))
json_df
## title authors attribute1
## 1 Freakonomics Steven D. Levitt Data
## 2 Freakonomics Stephen J. Dubner Data
## 3 The Information James Gleick Data
## 4 Data Science for Business Foster Provost Data and Business
## 5 Data Science for Business Tom Fawcett Data and Business
## attribute2 attribute3
## 1 Economics Analytics
## 2 Economics Analytics
## 3 Information Technology Evolution
## 4 Analytic Thinking Innovation
## 5 Analytic Thinking Innovation
HTML and XML data frames are identical. JSON data frame is very similar except that for books with more than one author, it creates different rows as the number of authors per book