Assignment 7: Working with XML and JSON in R

To help familiarize myself with the three different file formats I created a file of each type, all containing the same information about a few of my textbooks.

library(RCurl)
## Loading required package: bitops
library(XML)
library(jsonlite)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)

Parsing HTML

html_URL <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.html")
books_html <- readHTMLTable(html_URL, header = T)
books_html_df <- data.frame(books_html$MSDS_Books)
kable(books_html_df)
Title Author Author.1 Author.2 Author.3 Year_Published Publisher Pages ISBN.10
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining Simon Munzert Christian Rubba Peter Meissner Dominic Nyhuis 2015 Wiley 474 111883481X
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking Foster Provost Tom Fawcett 2013 O’Reilly Media 414 9781449361327
OpenIntro Statistics David M Diez Christopher D Barr Mine Cetinkaya-Rundel 2015 OpenIntro, Inc. 436 1943450056
books_html_df
##                                                                                           Title
## 1           Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3                                                                          OpenIntro Statistics
##           Author           Author.1              Author.2       Author.3
## 1  Simon Munzert    Christian Rubba        Peter Meissner Dominic Nyhuis
## 2 Foster Provost        Tom Fawcett                                     
## 3   David M Diez Christopher D Barr Mine Cetinkaya-Rundel               
##   Year_Published       Publisher Pages       ISBN.10
## 1           2015           Wiley   474    111883481X
## 2           2013  O'Reilly Media   414 9781449361327
## 3           2015 OpenIntro, Inc.   436    1943450056

Parsing JSON

json_URL <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.json")
books_json <- fromJSON(json_URL)
books_json_df <- data.frame(books_json$`MSDS Books`)
kable(books_json_df)
Title Author.1 Author.2 Author.3 Author.4 Year_Published Publisher Pages ISBN.10
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining Simon Munzert Christian Rubba Peter Meissner Dominic Nyhuis 2015 Wiley 474 111883481X
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking Foster Provost Tom Fawcett 2013 O’Reilly Media 414 9781449361327
OpenIntro Statistics David M Diez Christopher D Barr Mine Cetinkaya-Rundel 2015 OpenIntro, Inc. 436 1943450056
books_json_df
##                                                                                           Title
## 1           Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3                                                                          OpenIntro Statistics
##         Author.1           Author.2              Author.3       Author.4
## 1  Simon Munzert    Christian Rubba        Peter Meissner Dominic Nyhuis
## 2 Foster Provost        Tom Fawcett                                     
## 3   David M Diez Christopher D Barr Mine Cetinkaya-Rundel               
##   Year_Published       Publisher Pages       ISBN.10
## 1           2015           Wiley   474    111883481X
## 2           2013  O'Reilly Media   414 9781449361327
## 3           2015 OpenIntro, Inc.   436    1943450056

Parsing XML

xml_url <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.xml")
books_xml <- xmlParse(xml_url)
books_xml_rt <- xmlRoot(books_xml)
books_xml_df <- xmlToDataFrame(books_xml_rt)
kable(books_xml_df)
Title Author.1 Author.2 Author.3 Author.4 Year_Published Publisher Pages ISBN-10
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining Simon Munzert Christian Rubba Peter Meissner Dominic Nyhuis 2015 Wiley 474 111883481X
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking Foster Provost Tom Fawcett 2013 O’Reilly Media 414 9781449361327
OpenIntro Statistics David M Diez Christopher D Barr Mine Cetinkaya-Rundel 2015 OpenIntro, Inc. 436 9781449361327
books_xml_df
##                                                                                           Title
## 1           Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3                                                                          OpenIntro Statistics
##         Author.1           Author.2              Author.3       Author.4
## 1  Simon Munzert    Christian Rubba        Peter Meissner Dominic Nyhuis
## 2 Foster Provost        Tom Fawcett                                     
## 3   David M Diez Christopher D Barr Mine Cetinkaya-Rundel               
##   Year_Published       Publisher Pages       ISBN-10
## 1           2015           Wiley   474    111883481X
## 2           2013  O'Reilly Media   414 9781449361327
## 3           2015 OpenIntro, Inc.   436 9781449361327

Conclusion:

In conclusion I found a couple differences between the way the dataframe was created for each of the data file types. For both HTML and JSON file types I noticed that the hyphen was removed from the vairable name ISBN-10 and replaced with a decimal. While the dataframes mostly look the same other than that small difference, I also found that the JSON data frame was able to determine the difference between int and chartacter data types, meanwhile the html and XML data frames made all data types factors. Although I believe it is possible that result can be attributed to the way I formated the raw data files.