Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
rm(list = ls())
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.3.2
## Loading required package: bitops
library(XML)
## Warning: package 'XML' was built under R version 3.3.2
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.3.2
HTML_Books <- "https://raw.githubusercontent.com/zachdravis/CUNY-DATA-607/master/Assignment%20Week%207/books.html" #Set URL as object
HTML_Books <- getURLContent(HTML_Books) #Get the html content
HTML_Books <- readHTMLTable(HTML_Books) #Read HTML table
HTML_Books <- HTML_Books[[1]] #Remove from List
HTML_Books <- as.data.frame(HTML_Books) #Create data frame
XML_Books <- "https://raw.githubusercontent.com/zachdravis/CUNY-DATA-607/master/Assignment%20Week%207/books.xml" #Set URL as object
XML_Books <- getURLContent(XML_Books) #Get the XML content
XML_Books <- xmlToDataFrame(XML_Books) #Convert it to a dataframe
JSON_Books <- "https://raw.githubusercontent.com/zachdravis/CUNY-DATA-607/master/Assignment%20Week%207/books.json"
JSON_Books <- fromJSON(JSON_Books) #Converts to a list
JSON_Books <- JSON_Books[[1]] #Index the list
JSON_Books <- as.data.frame(JSON_Books) #Convert to Data frame
View(HTML_Books)
View(XML_Books)
View(JSON_Books)
Looking at these three data frames, they seem very similar. Let’s investigate a bit further
str(HTML_Books)
## 'data.frame': 3 obs. of 4 variables:
## $ Book Name : Factor w/ 3 levels "Automated Data Collection with R",..: 1 2 3
## $ Book Publication Year: Factor w/ 2 levels "2015","2017": 1 1 2
## $ Book Publisher : Factor w/ 3 levels "O'Reilly","OpenIntro",..: 3 2 1
## $ Book authors : Factor w/ 3 levels "David M Diez, Christopher D Barr, Mina Cetinkaya-Rundel",..: 3 1 2
str(XML_Books)
## 'data.frame': 3 obs. of 4 variables:
## $ Book_Name : Factor w/ 3 levels "Automated Data Collection with R",..: 1 2 3
## $ Book_Publication_Year: Factor w/ 2 levels "2015","2017": 1 1 2
## $ Book_Publisher : Factor w/ 3 levels "OpenIntro","OReilly",..: 3 1 2
## $ Book_Authors : Factor w/ 3 levels "David M Diez, Christopher D Barr, Mina Cetinkaya-Rundel",..: 3 1 2
str(JSON_Books)
## 'data.frame': 3 obs. of 4 variables:
## $ Book Name : chr "Automated Data Collection with R" "OpenIntro Statistics" "R for Data Science"
## $ Book Publication Year: chr "2015" "2015" "2017"
## $ Book Publisher : chr "Wiley" "OpenIntro" "O'Reilly"
## $ Book author(s) :List of 3
## ..$ : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
## ..$ : chr "David M Diez, Christopher D Barr, Mina Cetinkaya-Rundel"
## ..$ : chr "Hadley Wickham, Garrett Grolemund"
It looks like the HTML and XML tables are the same but that the JSON table stores the authors (the comma separated, multiple values) as a list. The other strings are stored as characters, whereas in the XML and HTML tables they are stored as factors.
JSON_Books[1,3]
## [1] "Wiley"
JSON_Books[1,4]
## [[1]]
## [1] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"