DATA 607 - Week 7 Assignment

Introduction
HTML file
XML file
JSON file
Conclusions

Introduction

For this assignment, I chose three books from our assigned or suggested reading, and created HTML, XML, and JSON files containing the following fields:

Title
Author (two of the books have two authors each)
Publisher
Year of publication
Cost in USD
Subtitle

The three files are saved on GitHub.

To start, we load several libraries that we will use below.

# load libraries
library(tidyverse)
library(knitr)
library(RCurl)
library(XML)
library(jsonlite)

HTML file

First, let’s load the HTML file and create a data frame using the readHTMLTable function.

# read and parse html file
url_h <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.html"
# for some reason, can't use htmlParse directly on the URL; so use getURL
raw_h <- htmlParse(getURL(url_h))
class(raw_h)

## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"

# read html table
raw_h1 <- readHTMLTable(raw_h, stringsAsFactors = FALSE)
class(raw_h1)

## [1] "list"

# this returned a list, so extract first element as the data frame
df_h <- raw_h1[[1]]
str(df_h)

## 'data.frame':    3 obs. of  6 variables:
##  $ Title       : chr  "Data Science for Business" "R for Data Science" "R for Everyone"
##  $ Author      : chr  "Foster Provost & Tom Fawcett" "Hadley Wickham & Garrett Grolemund" "Jared P. Lander"
##  $ Publisher   : chr  "O'Reilly" "O'Reilly" "Addison Wesley"
##  $ Publish Year: chr  "2013" "2017" "2017"
##  $ Cost USD    : chr  "39.99" "39.99" "44.99"
##  $ Subtitle    : chr  "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"

kable(df_h)

Title	Author	Publisher	Publish Year	Cost USD	Subtitle
Data Science for Business	Foster Provost & Tom Fawcett	O’Reilly	2013	39.99	What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science	Hadley Wickham & Garrett Grolemund	O’Reilly	2017	39.99	Import, Tidy, Transform, Visualize, and Model Data
R for Everyone	Jared P. Lander	Addison Wesley	2017	44.99	Advanced Analytics and Graphics

Notice that for the two books with two authors, the two authors show up in the same “Author” field, separated by an “&”. To separate the authors, we use str_split and mutate to create the final data frame.

# separate the two authors
temp <- str_split(df_h$Author, "&", simplify = TRUE)
temp

##      [,1]              [,2]                
## [1,] "Foster Provost " " Tom Fawcett"      
## [2,] "Hadley Wickham " " Garrett Grolemund"
## [3,] "Jared P. Lander" ""

# final data frame
df_h1 <- df_h %>% mutate(Author1 = temp[ , 1], Author2 = temp[ , 2]) %>% select(Title, Author1, Author2, Publisher, "Publish Year", "Cost USD", Subtitle)
kable(df_h1)

Title	Author1	Author2	Publisher	Publish Year	Cost USD	Subtitle
Data Science for Business	Foster Provost	Tom Fawcett	O’Reilly	2013	39.99	What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science	Hadley Wickham	Garrett Grolemund	O’Reilly	2017	39.99	Import, Tidy, Transform, Visualize, and Model Data
R for Everyone	Jared P. Lander		Addison Wesley	2017	44.99	Advanced Analytics and Graphics

XML file

Next, let’s load the XML file and create a data frame using the xmlToDataFrame function.

# read and parse xml file
url_x <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.xml"
# same as above, can't use xmlParse directly on the URL; so use getURL
raw_x <- xmlParse(getURL(url_x))
class(raw_x)

## [1] "XMLInternalDocument" "XMLAbstractDocument"

# transform xml object into data frame
df_x <- xmlToDataFrame(raw_x, stringsAsFactors = FALSE)
str(df_x)

## 'data.frame':    3 obs. of  6 variables:
##  $ title       : chr  "Data Science for Business" "R for Data Science" "R for Everyone"
##  $ author      : chr  "Foster Provost & Tom Fawcett" "Hadley Wickham & Garrett Grolemund" "Jared P. Lander"
##  $ publisher   : chr  "O'Reilly" "O'Reilly" "Addison Wesley"
##  $ publish_year: chr  "2013" "2017" "2017"
##  $ cost_USD    : chr  "39.99" "39.99" "44.99"
##  $ subtitle    : chr  "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"

kable(df_x)

title	author	publisher	publish_year	cost_USD	subtitle
Data Science for Business	Foster Provost & Tom Fawcett	O’Reilly	2013	39.99	What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science	Hadley Wickham & Garrett Grolemund	O’Reilly	2017	39.99	Import, Tidy, Transform, Visualize, and Model Data
R for Everyone	Jared P. Lander	Addison Wesley	2017	44.99	Advanced Analytics and Graphics

As before, we can use str_split to separate the two authors into separate fields, and then mutate to create the final data frame

# separate the two authors
temp <- str_split(df_x$author, "&", simplify = TRUE)
df_x1 <- df_x %>% mutate(author1 = temp[ , 1], author2 = temp[ , 2]) %>% select(title, author1, author2, publisher, publish_year, cost_USD, subtitle)
# capitalize column headings
colnames(df_x1) <- str_to_title(colnames(df_x1))
kable(df_x1)

Title	Author1	Author2	Publisher	Publish_year	Cost_usd	Subtitle
Data Science for Business	Foster Provost	Tom Fawcett	O’Reilly	2013	39.99	What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science	Hadley Wickham	Garrett Grolemund	O’Reilly	2017	39.99	Import, Tidy, Transform, Visualize, and Model Data
R for Everyone	Jared P. Lander		Addison Wesley	2017	44.99	Advanced Analytics and Graphics

JSON file

Finally, we load the JSON file and create a data frame using the fromJSON function.

# read json file
url_j <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.json"
raw_j <- fromJSON(url_j)
class(raw_j)

## [1] "list"

# this returned a list, so extract first element as the data frame
df_j <- raw_j[[1]]
str(df_j)

## 'data.frame':    3 obs. of  6 variables:
##  $ title       : chr  "Data Science for Business" "R for Data Science" "R for Everyone"
##  $ author      :List of 3
##   ..$ : chr  "Foster Provost" "Tom Fawcett"
##   ..$ : chr  "Hadley Wickham" "Garrett Grolemund"
##   ..$ : chr "Jared P. Lander"
##  $ publisher   : chr  "O'Reilly" "O'Reilly" "Addison Wesley"
##  $ publish_year: int  2013 2017 2017
##  $ cost_USD    : num  40 40 45
##  $ subtitle    : chr  "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"

kable(df_j)

title	author	publisher	publish_year	cost_USD	subtitle
Data Science for Business	c(“Foster Provost”, “Tom Fawcett”)	O’Reilly	2013	39.99	What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science	c(“Hadley Wickham”, “Garrett Grolemund”)	O’Reilly	2017	39.99	Import, Tidy, Transform, Visualize, and Model Data
R for Everyone	Jared P. Lander	Addison Wesley	2017	44.99	Advanced Analytics and Graphics

Notice that for the two books with two authors, the two authors now show up as a character vector in the “author” field. This is because in the original JSON file, the two authors are saved as a value array corresponding to the author key. We can separate the two authors into separate fields by splitting the character vector, and then using mutate to create the final data frame.

# separate the two authors
str(df_j$author)

## List of 3
##  $ : chr [1:2] "Foster Provost" "Tom Fawcett"
##  $ : chr [1:2] "Hadley Wickham" "Garrett Grolemund"
##  $ : chr "Jared P. Lander"

# this is a list of vectors, so need to split
n = length(df_j$author)
author1 <- character(length = n)
author2 <- character(length = n)
for (k in 1:n){
    author1[k] <- df_j$author[[k]][1]
    author2[k] <- df_j$author[[k]][2]
}
df_j1 <- df_j %>% mutate(author1, author2) %>% select(title, author1, author2, publisher, publish_year, cost_USD, subtitle)
# capitalize column headings
colnames(df_j1) <- str_to_title(colnames(df_j1))
kable(df_j1)

Title	Author1	Author2	Publisher	Publish_year	Cost_usd	Subtitle
Data Science for Business	Foster Provost	Tom Fawcett	O’Reilly	2013	39.99	What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science	Hadley Wickham	Garrett Grolemund	O’Reilly	2017	39.99	Import, Tidy, Transform, Visualize, and Model Data
R for Everyone	Jared P. Lander	NA	Addison Wesley	2017	44.99	Advanced Analytics and Graphics

Conclusions

The three final data frames are very similar but not identical. In particular, the data frames differ in how the instance of two authors in the “author” field are handled:

In the HTML and XML files, the two authors were entered as a single string (separated by an “&”) in the “author” field. As a result, when the HTML and XML files were loaded, the two authors had to be separated by splitting the character string.
Alternatively, in the HTML table, the two authors could have been saved in separate columns, which would have avoided the extra step in the HTML data frame. Also, in the XML file, the two authors could have been saved as separate attributes in the “author” field, which would have required a different step to extract and separate the author attribute data in the data frame.
In contrast, in the JSON file, the two authors were entered as an array in the “author” field. This caused the author data to load into the data frame as a character vector, which required an extra step to split into separate author fields.