Week7 Assignment - Web Data

Installing and Loading necessary packages

#install.packages("XML")
#install.packages("jsonlite")
#install.packages("RJSONIO")
#install.packages("RCurl")
#install.packages("htmltab")
library(XML)
library(jsonlite)
library(RJSONIO)

## 
## Attaching package: 'RJSONIO'

## The following objects are masked from 'package:jsonlite':
## 
##     fromJSON, toJSON

library(RCurl)

## Loading required package: bitops

library(httr)
library(plyr)
library(htmltab)

Load HTML file to a data frame

url_html <- c("https://raw.githubusercontent.com/humbertohpgit/MSDS1stSem/master/Books.html")
html_file <- getURL(url_html) ## RCurl package
html_df <- htmltab(doc = html_file)

## Argument 'which' was left unspecified. Choosing first table.

class(html_df)

## [1] "data.frame"

html_df

##                  Book Title                             Authors
## 2              Freakonomics Steven D. Levitt, Stephen J. Dubner
## 3           The Information                        James Gleick
## 4 Data Science for Business         Foster Provost, Tom Fawcett
##          Attribute1             Attribute2 Attribute3
## 2              Data              Economics  Analytics
## 3              Data Information Technology  Evolution
## 4 Data and Business      Analytic Thinking Innovation

Load XML file to a data frame

url_xml <- c("https://raw.githubusercontent.com/humbertohpgit/MSDS1stSem/master/Books.xml")
xml_file <- GET(url_xml) ## httr package
xml_data <- xmlParse(xml_file)
class(xml_data)

## [1] "XMLInternalDocument" "XMLAbstractDocument"

topxml <- xmlRoot(xml_data)
topxml <- xmlSApply(topxml, 
                    function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
                     row.names=NULL)
xml_df

##                       title                             authors
## 1              Freakonomics Steven D. Levitt, Stephen J. Dubner
## 2           The Information                        James Gleick
## 3 Data Science for Business         Foster Provost, Tom Fawcett
##          attribute1             attribute2 attribute3
## 1              Data              Economics  Analytics
## 2              Data Information Technology  Evolution
## 3 Data and Business      Analytic Thinking Innovation

Load JSON file to a data frame

isValidJSON("https://raw.githubusercontent.com/humbertohpgit/MSDS1stSem/master/Books.json")

## [1] TRUE

json_data <- fromJSON(content = "https://raw.githubusercontent.com/humbertohpgit/MSDS1stSem/master/Books.json")
class(json_data)

## [1] "list"

json_df <- do.call("rbind", lapply(json_data[[1]], data.frame, stringsAsFactors = FALSE))
json_df

##                       title           authors        attribute1
## 1              Freakonomics  Steven D. Levitt              Data
## 2              Freakonomics Stephen J. Dubner              Data
## 3           The Information      James Gleick              Data
## 4 Data Science for Business    Foster Provost Data and Business
## 5 Data Science for Business       Tom Fawcett Data and Business
##               attribute2 attribute3
## 1              Economics  Analytics
## 2              Economics  Analytics
## 3 Information Technology  Evolution
## 4      Analytic Thinking Innovation
## 5      Analytic Thinking Innovation

HTML and XML data frames are identical. JSON data frame is very similar except that for books with more than one author, it creates different rows as the number of authors per book

Week7 Assignment - Web Data

humbertohp

October 10, 2018

Installing and Loading necessary packages

Load HTML file to a data frame

Load XML file to a data frame

Load JSON file to a data frame

HTML and XML data frames are identical. JSON data frame is very similar except that for books with more than one author, it creates different rows as the number of authors per book