Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books3.xml”, and “books3.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
References to the books.html, books.xml, and books.json.
Books.html: https://raw.githubusercontent.com/jcp9010/MSDA/master/books.html
Books.xml: https://raw.githubusercontent.com/jcp9010/MSDA/master/books3.xml
Books2.jason: https://raw.githubusercontent.com/jcp9010/MSDA/master/books3.json
Utilizing http://www.rawgit.com, the raw files from GitHub were converted with proper Content-Type headers.
Load libraries:
library(RCurl)
library(XML)
library(jsonlite)
library(data.table)
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames.
# Assigning html, xml, json addresses to local variables
html.url <- "http://cdn.rawgit.com/jcp9010/MSDA/master/books.html"
xml.url <- "http://cdn.rawgit.com/jcp9010/MSDA/master/books3.xml"
json.url <- "http://cdn.rawgit.com/jcp9010/MSDA/master/books3.json"
# Parsing the data
comic.html <- htmlParse(file = html.url)
comic.xml <- xmlParse(xml.url)
comic.json <- fromJSON(json.url)
# Converting a parsed HTML file into a data.frame
comic.html.df <- as.data.frame(readHTMLTable(comic.html))
comic.html.df
## NULL.Book.Name
## 1 The Complete Calvin and Hobbes
## 2 The Walking Dead Compendium Vol. 1
## 3 Peanuts 2000: The 50th Year of the World's Favorite Comic Strip
## NULL.Author
## 1 Bill Watterson
## 2 Robert Kirkman, Charlie Adlard, and Cliff Rathburn
## 3 Charles M. Schulz
## NULL.Publisher NULL.Year.Published NULL.Pages NULL.Language
## 1 Andrews McMeel Publishing 2012 1456 English
## 2 Skybound 2016 1088 English
## 3 Ballantine Books 2000 176 English
# Converting a parsed XML file into a data.frame
root <- xmlRoot(comic.xml)
comic.xml.df <- xmlToDataFrame(root)
comic.xml.df
## Book_Name
## 1 The Complete Calvin and Hobbes
## 2 The Walking Dead Compendium Vol. 1
## 3 Peanuts 2000: The 50th Year of the World's Favorite Comic Strip
## Author Publisher
## 1 Bill Watterson Andrews McMeel Publishing
## 2 Robert Kirkman, Charlie Adlard, Cliff Rathburn Skybound
## 3 Charles M. Schulz Ballantine Books
## Year_Published Pages Language
## 1 2012 1456 English
## 2 2016 1088 English
## 3 2000 176 English
# Converting comic.json (list) into a data.frame
comic.json.df <- as.data.frame(comic.json)
comic.json.df
## books.Book_Name
## 1 Calvin and Hobbes
## 2 The Walking Dead Compendium Vol. 1
## 3 Peanuts 2000: The 50th Year of the World's Favorite Comic Strip
## books.Author books.Publisher
## 1 Bill Watterson Andrews McMeel Publishing
## 2 Robert Kirkman, Charlie Adlard, Cliff Rathburn Skybound
## 3 Charles M. Schulz Ballantine Books
## books.Year books.Pages books.Language
## 1 2012 1456 English
## 2 2016 1088 English
## 3 2000 176 English
Are the three data frames identical?
We can look at the structures of each data.frame to see if they are identical.
str(comic.html.df)
## 'data.frame': 3 obs. of 6 variables:
## $ NULL.Book.Name : Factor w/ 3 levels "Peanuts 2000: The 50th Year of the World's Favorite Comic Strip",..: 2 3 1
## $ NULL.Author : Factor w/ 3 levels "Bill Watterson",..: 1 3 2
## $ NULL.Publisher : Factor w/ 3 levels "Andrews McMeel Publishing",..: 1 3 2
## $ NULL.Year.Published: Factor w/ 3 levels "2000","2012",..: 2 3 1
## $ NULL.Pages : Factor w/ 3 levels "1088","1456",..: 2 1 3
## $ NULL.Language : Factor w/ 1 level "English": 1 1 1
str(comic.xml.df)
## 'data.frame': 3 obs. of 6 variables:
## $ Book_Name : Factor w/ 3 levels "Peanuts 2000: The 50th Year of the World's Favorite Comic Strip",..: 2 3 1
## $ Author : Factor w/ 3 levels "Bill Watterson",..: 1 3 2
## $ Publisher : Factor w/ 3 levels "Andrews McMeel Publishing",..: 1 3 2
## $ Year_Published: Factor w/ 3 levels "2000","2012",..: 2 3 1
## $ Pages : Factor w/ 3 levels "1088","1456",..: 2 1 3
## $ Language : Factor w/ 1 level "English": 1 1 1
str(comic.json.df)
## 'data.frame': 3 obs. of 6 variables:
## $ books.Book_Name: chr "Calvin and Hobbes" "The Walking Dead Compendium Vol. 1" "Peanuts 2000: The 50th Year of the World's Favorite Comic Strip"
## $ books.Author : chr "Bill Watterson" "Robert Kirkman, Charlie Adlard, Cliff Rathburn" "Charles M. Schulz"
## $ books.Publisher: chr "Andrews McMeel Publishing" "Skybound" "Ballantine Books"
## $ books.Year : int 2012 2016 2000
## $ books.Pages : int 1456 1088 176
## $ books.Language : chr "English" "English" "English"
After looking at the structures, the HTML and XML data.frame looks identical to each other (other than the names of the headers i.e. Null.Book.Name vs. Book_Name). They are all listed as factors with multiple levels, whereas, the JSON data frame uses chr, int in their data.frame.