MSDS Spring 2018

DATA 607 Data Aquisition and Management

Jiadi Li

Week 7 Assignment:Working with XML and JSON in R

Create three files which store the books’ information in HTML(using an html table), XML, and JSON formats.(Uploaded on Github)

Write R code to load the information from each of the three sources into separate R data frames.

  1. load packages
library(rjson)
library(RCurl)
## Loading required package: bitops
library(XML)
library(stringr)
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.4.4
## 
## Attaching package: 'jsonlite'
## The following objects are masked from 'package:rjson':
## 
##     fromJSON, toJSON
  1. HTML
#Import HTML file and read HTML table 
book.html.url <- getURL("https://raw.githubusercontent.com/xiaoxiaogao-DD/DATA607_Assignment7/master/books.html")
book.html.table <- readHTMLTable(book.html.url,header = TRUE)

#Convert HTML table into a data frame
book.html.dataframe <- as.data.frame(book.html.table)

#Adjust the column names
colnames(book.html.dataframe) <- substring(colnames(book.html.dataframe),6,)

book.html.dataframe
##                                                        title
## 1                                   Python for Data Analysis
## 2 Hands-On Machine Learning with Scikit-Learn and TensorFlow
## 3                                         R for Data Science
##                ISBN                        editors price
## 1 978-1-449-31979-3 Julie Steele;Meghan Blanchette 39.99
## 2 978-1-491-96229-9                   Nicole Tache 49.99
## 3 978-1-491-91039-9 Marie Beaugureau;Mike Loukides 39.99
  1. JSON
#Import JSON file
book.json.url <- getURL("https://raw.githubusercontent.com/xiaoxiaogao-DD/DATA607_Assignment7/master/books.json")

#Convert data in JSON into a data frame
book.json.dataframe <- flatten(as.data.frame(fromJSON(book.json.url)))

book.json.dataframe
##                                                        title
## 1                                   Python for Data Analysis
## 2 Hands-On Machine Learning with Scikit-Learn and TensorFlow
## 3                                         R for Data Science
##                ISBN                        editors price
## 1 978-1-449-31979-3 Julie Steele;Meghan Blanchette 39.99
## 2 978-1-491-96229-9                   Nicole Tache 49.99
## 3 978-1-491-91039-9 Marie Beaugureau;Mike Loukides 39.99
  1. XML
#Import XML file 
book.xml.url <- getURL("https://raw.githubusercontent.com/xiaoxiaogao-DD/DATA607_Assignment7/master/books.xml")

#Convert data in XML into a data frame
book.xml.dataframe <- xmlToDataFrame(xmlParse(book.xml.url))

book.xml.dataframe
##                                                        title
## 1                                   Python for Data Analysis
## 2 Hands-On Machine Learning with Scikit-Learn and TensorFlow
## 3                                         R for Data Science
##                isbn                        editors price
## 1 978-1-449-31979-3 Julie Steele;Meghan Blanchette 39.99
## 2 978-1-491-96229-9                   Nicole Tache 49.99
## 3 978-1-491-91039-9 Marie Beaugureau;Mike Loukides 39.99

Are the three data frames identical?
While the HTML, JSON and XML files have different structures, after manipulation process, the three data frames created are very similar especially for the ones from HTML and XML. For the JSON data frame, serial numbers are created and the data types are signed automatically.