Week 7 Assignment

Assignment 7: Working with XML and JSON in R

To help familiarize myself with the three different file formats I created a file of each type, all containing the same information about a few of my textbooks.

library(RCurl)

## Loading required package: bitops

library(XML)
library(jsonlite)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)

Parsing HTML

html_URL <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.html")
books_html <- readHTMLTable(html_URL, header = T)
books_html_df <- data.frame(books_html$MSDS_Books)
kable(books_html_df)

Title	Author	Author.1	Author.2	Author.3	Year_Published	Publisher	Pages	ISBN.10
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining	Simon Munzert	Christian Rubba	Peter Meissner	Dominic Nyhuis	2015	Wiley	474	111883481X
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	Foster Provost	Tom Fawcett			2013	O’Reilly Media	414	9781449361327
OpenIntro Statistics	David M Diez	Christopher D Barr	Mine Cetinkaya-Rundel		2015	OpenIntro, Inc.	436	1943450056

books_html_df

##                                                                                           Title
## 1           Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3                                                                          OpenIntro Statistics
##           Author           Author.1              Author.2       Author.3
## 1  Simon Munzert    Christian Rubba        Peter Meissner Dominic Nyhuis
## 2 Foster Provost        Tom Fawcett                                     
## 3   David M Diez Christopher D Barr Mine Cetinkaya-Rundel               
##   Year_Published       Publisher Pages       ISBN.10
## 1           2015           Wiley   474    111883481X
## 2           2013  O'Reilly Media   414 9781449361327
## 3           2015 OpenIntro, Inc.   436    1943450056

Parsing JSON

json_URL <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.json")
books_json <- fromJSON(json_URL)
books_json_df <- data.frame(books_json$`MSDS Books`)
kable(books_json_df)

Title	Author.1	Author.2	Author.3	Author.4	Year_Published	Publisher	Pages	ISBN.10
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining	Simon Munzert	Christian Rubba	Peter Meissner	Dominic Nyhuis	2015	Wiley	474	111883481X
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	Foster Provost	Tom Fawcett			2013	O’Reilly Media	414	9781449361327
OpenIntro Statistics	David M Diez	Christopher D Barr	Mine Cetinkaya-Rundel		2015	OpenIntro, Inc.	436	1943450056

books_json_df

##                                                                                           Title
## 1           Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3                                                                          OpenIntro Statistics
##         Author.1           Author.2              Author.3       Author.4
## 1  Simon Munzert    Christian Rubba        Peter Meissner Dominic Nyhuis
## 2 Foster Provost        Tom Fawcett                                     
## 3   David M Diez Christopher D Barr Mine Cetinkaya-Rundel               
##   Year_Published       Publisher Pages       ISBN.10
## 1           2015           Wiley   474    111883481X
## 2           2013  O'Reilly Media   414 9781449361327
## 3           2015 OpenIntro, Inc.   436    1943450056

Parsing XML

xml_url <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.xml")
books_xml <- xmlParse(xml_url)
books_xml_rt <- xmlRoot(books_xml)
books_xml_df <- xmlToDataFrame(books_xml_rt)
kable(books_xml_df)

Title	Author.1	Author.2	Author.3	Author.4	Year_Published	Publisher	Pages	ISBN-10
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining	Simon Munzert	Christian Rubba	Peter Meissner	Dominic Nyhuis	2015	Wiley	474	111883481X
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	Foster Provost	Tom Fawcett			2013	O’Reilly Media	414	9781449361327
OpenIntro Statistics	David M Diez	Christopher D Barr	Mine Cetinkaya-Rundel		2015	OpenIntro, Inc.	436	9781449361327

books_xml_df

##                                                                                           Title
## 1           Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3                                                                          OpenIntro Statistics
##         Author.1           Author.2              Author.3       Author.4
## 1  Simon Munzert    Christian Rubba        Peter Meissner Dominic Nyhuis
## 2 Foster Provost        Tom Fawcett                                     
## 3   David M Diez Christopher D Barr Mine Cetinkaya-Rundel               
##   Year_Published       Publisher Pages       ISBN-10
## 1           2015           Wiley   474    111883481X
## 2           2013  O'Reilly Media   414 9781449361327
## 3           2015 OpenIntro, Inc.   436 9781449361327

Conclusion:

In conclusion I found a couple differences between the way the dataframe was created for each of the data file types. For both HTML and JSON file types I noticed that the hyphen was removed from the vairable name ISBN-10 and replaced with a decimal. While the dataframes mostly look the same other than that small difference, I also found that the JSON data frame was able to determine the difference between int and chartacter data types, meanwhile the html and XML data frames made all data types factors. Although I believe it is possible that result can be attributed to the way I formated the raw data files.