Assignment 7: Working with XML and JSON in R
To help familiarize myself with the three different file formats I created a file of each type, all containing the same information about a few of my textbooks.
library(RCurl)
## Loading required package: bitops
library(XML)
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
Parsing HTML
html_URL <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.html")
books_html <- readHTMLTable(html_URL, header = T)
books_html_df <- data.frame(books_html$MSDS_Books)
kable(books_html_df)
| Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining |
Simon Munzert |
Christian Rubba |
Peter Meissner |
Dominic Nyhuis |
2015 |
Wiley |
474 |
111883481X |
| Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking |
Foster Provost |
Tom Fawcett |
|
|
2013 |
O’Reilly Media |
414 |
9781449361327 |
| OpenIntro Statistics |
David M Diez |
Christopher D Barr |
Mine Cetinkaya-Rundel |
|
2015 |
OpenIntro, Inc. |
436 |
1943450056 |
books_html_df
## Title
## 1 Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3 OpenIntro Statistics
## Author Author.1 Author.2 Author.3
## 1 Simon Munzert Christian Rubba Peter Meissner Dominic Nyhuis
## 2 Foster Provost Tom Fawcett
## 3 David M Diez Christopher D Barr Mine Cetinkaya-Rundel
## Year_Published Publisher Pages ISBN.10
## 1 2015 Wiley 474 111883481X
## 2 2013 O'Reilly Media 414 9781449361327
## 3 2015 OpenIntro, Inc. 436 1943450056
Parsing JSON
json_URL <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.json")
books_json <- fromJSON(json_URL)
books_json_df <- data.frame(books_json$`MSDS Books`)
kable(books_json_df)
| Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining |
Simon Munzert |
Christian Rubba |
Peter Meissner |
Dominic Nyhuis |
2015 |
Wiley |
474 |
111883481X |
| Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking |
Foster Provost |
Tom Fawcett |
|
|
2013 |
O’Reilly Media |
414 |
9781449361327 |
| OpenIntro Statistics |
David M Diez |
Christopher D Barr |
Mine Cetinkaya-Rundel |
|
2015 |
OpenIntro, Inc. |
436 |
1943450056 |
books_json_df
## Title
## 1 Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3 OpenIntro Statistics
## Author.1 Author.2 Author.3 Author.4
## 1 Simon Munzert Christian Rubba Peter Meissner Dominic Nyhuis
## 2 Foster Provost Tom Fawcett
## 3 David M Diez Christopher D Barr Mine Cetinkaya-Rundel
## Year_Published Publisher Pages ISBN.10
## 1 2015 Wiley 474 111883481X
## 2 2013 O'Reilly Media 414 9781449361327
## 3 2015 OpenIntro, Inc. 436 1943450056
Parsing XML
xml_url <- getURL("https://raw.githubusercontent.com/rlauto/DATA-607-Assignments/master/Week7Assignment/books.xml")
books_xml <- xmlParse(xml_url)
books_xml_rt <- xmlRoot(books_xml)
books_xml_df <- xmlToDataFrame(books_xml_rt)
kable(books_xml_df)
| Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining |
Simon Munzert |
Christian Rubba |
Peter Meissner |
Dominic Nyhuis |
2015 |
Wiley |
474 |
111883481X |
| Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking |
Foster Provost |
Tom Fawcett |
|
|
2013 |
O’Reilly Media |
414 |
9781449361327 |
| OpenIntro Statistics |
David M Diez |
Christopher D Barr |
Mine Cetinkaya-Rundel |
|
2015 |
OpenIntro, Inc. |
436 |
9781449361327 |
books_xml_df
## Title
## 1 Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining
## 2 Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
## 3 OpenIntro Statistics
## Author.1 Author.2 Author.3 Author.4
## 1 Simon Munzert Christian Rubba Peter Meissner Dominic Nyhuis
## 2 Foster Provost Tom Fawcett
## 3 David M Diez Christopher D Barr Mine Cetinkaya-Rundel
## Year_Published Publisher Pages ISBN-10
## 1 2015 Wiley 474 111883481X
## 2 2013 O'Reilly Media 414 9781449361327
## 3 2015 OpenIntro, Inc. 436 9781449361327
Conclusion:
In conclusion I found a couple differences between the way the dataframe was created for each of the data file types. For both HTML and JSON file types I noticed that the hyphen was removed from the vairable name ISBN-10 and replaced with a decimal. While the dataframes mostly look the same other than that small difference, I also found that the JSON data frame was able to determine the difference between int and chartacter data types, meanwhile the html and XML data frames made all data types factors. Although I believe it is possible that result can be attributed to the way I formated the raw data files.