Task

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Load needed libraries

rm(list = ls())
library(RCurl)

## Warning: package 'RCurl' was built under R version 3.3.2

## Loading required package: bitops

library(XML)

## Warning: package 'XML' was built under R version 3.3.2

library(jsonlite)

## Warning: package 'jsonlite' was built under R version 3.3.2

HTML

HTML_Books <- "https://raw.githubusercontent.com/zachdravis/CUNY-DATA-607/master/Assignment%20Week%207/books.html" #Set URL as object
HTML_Books <- getURLContent(HTML_Books) #Get the html content
HTML_Books <- readHTMLTable(HTML_Books) #Read HTML table
HTML_Books <- HTML_Books[[1]] #Remove from List
HTML_Books <- as.data.frame(HTML_Books) #Create data frame

XML

XML_Books <- "https://raw.githubusercontent.com/zachdravis/CUNY-DATA-607/master/Assignment%20Week%207/books.xml" #Set URL as object
XML_Books <- getURLContent(XML_Books) #Get the XML content
XML_Books <- xmlToDataFrame(XML_Books) #Convert it to a dataframe

JSON

JSON_Books <- "https://raw.githubusercontent.com/zachdravis/CUNY-DATA-607/master/Assignment%20Week%207/books.json"
JSON_Books <- fromJSON(JSON_Books) #Converts to a list
JSON_Books <- JSON_Books[[1]] #Index the list
JSON_Books <- as.data.frame(JSON_Books) #Convert to Data frame

Examine Differences

View(HTML_Books)
View(XML_Books)
View(JSON_Books)

Looking at these three data frames, they seem very similar. Let’s investigate a bit further

str(HTML_Books)

## 'data.frame':    3 obs. of  4 variables:
##  $ Book Name            : Factor w/ 3 levels "Automated Data Collection with R",..: 1 2 3
##  $ Book Publication Year: Factor w/ 2 levels "2015","2017": 1 1 2
##  $ Book Publisher       : Factor w/ 3 levels "O'Reilly","OpenIntro",..: 3 2 1
##  $ Book authors         : Factor w/ 3 levels "David M Diez, Christopher D Barr, Mina Cetinkaya-Rundel",..: 3 1 2

str(XML_Books)

## 'data.frame':    3 obs. of  4 variables:
##  $ Book_Name            : Factor w/ 3 levels "Automated Data Collection with R",..: 1 2 3
##  $ Book_Publication_Year: Factor w/ 2 levels "2015","2017": 1 1 2
##  $ Book_Publisher       : Factor w/ 3 levels "OpenIntro","OReilly",..: 3 1 2
##  $ Book_Authors         : Factor w/ 3 levels "David M Diez, Christopher D Barr, Mina Cetinkaya-Rundel",..: 3 1 2

str(JSON_Books)

## 'data.frame':    3 obs. of  4 variables:
##  $ Book Name            : chr  "Automated Data Collection with R" "OpenIntro Statistics" "R for Data Science"
##  $ Book Publication Year: chr  "2015" "2015" "2017"
##  $ Book Publisher       : chr  "Wiley" "OpenIntro" "O'Reilly"
##  $ Book author(s)       :List of 3
##   ..$ : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
##   ..$ : chr "David M Diez, Christopher D Barr, Mina Cetinkaya-Rundel"
##   ..$ : chr "Hadley Wickham, Garrett Grolemund"

It looks like the HTML and XML tables are the same but that the JSON table stores the authors (the comma separated, multiple values) as a list. The other strings are stored as characters, whereas in the XML and HTML tables they are stored as factors.

JSON_Books[1,3]

## [1] "Wiley"

JSON_Books[1,4]

## [[1]]
## [1] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"

DATA607 Assignment 7

Zach Dravis

5/2/2018

Task

Load needed libraries

HTML

XML

JSON

Examine Differences