Introduction

For this assignment, I chose three books from our assigned or suggested reading, and created HTML, XML, and JSON files containing the following fields:

The three files are saved on GitHub.

To start, we load several libraries that we will use below.

# load libraries
library(tidyverse)
library(knitr)
library(RCurl)
library(XML)
library(jsonlite)

HTML file

First, let’s load the HTML file and create a data frame using the readHTMLTable function.

# read and parse html file
url_h <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.html"
# for some reason, can't use htmlParse directly on the URL; so use getURL
raw_h <- htmlParse(getURL(url_h))
class(raw_h)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"
# read html table
raw_h1 <- readHTMLTable(raw_h, stringsAsFactors = FALSE)
class(raw_h1)
## [1] "list"
# this returned a list, so extract first element as the data frame
df_h <- raw_h1[[1]]
str(df_h)
## 'data.frame':    3 obs. of  6 variables:
##  $ Title       : chr  "Data Science for Business" "R for Data Science" "R for Everyone"
##  $ Author      : chr  "Foster Provost & Tom Fawcett" "Hadley Wickham & Garrett Grolemund" "Jared P. Lander"
##  $ Publisher   : chr  "O'Reilly" "O'Reilly" "Addison Wesley"
##  $ Publish Year: chr  "2013" "2017" "2017"
##  $ Cost USD    : chr  "39.99" "39.99" "44.99"
##  $ Subtitle    : chr  "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"
kable(df_h)
Title Author Publisher Publish Year Cost USD Subtitle
Data Science for Business Foster Provost & Tom Fawcett O’Reilly 2013 39.99 What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science Hadley Wickham & Garrett Grolemund O’Reilly 2017 39.99 Import, Tidy, Transform, Visualize, and Model Data
R for Everyone Jared P. Lander Addison Wesley 2017 44.99 Advanced Analytics and Graphics

Notice that for the two books with two authors, the two authors show up in the same “Author” field, separated by an “&”. To separate the authors, we use str_split and mutate to create the final data frame.

# separate the two authors
temp <- str_split(df_h$Author, "&", simplify = TRUE)
temp
##      [,1]              [,2]                
## [1,] "Foster Provost " " Tom Fawcett"      
## [2,] "Hadley Wickham " " Garrett Grolemund"
## [3,] "Jared P. Lander" ""
# final data frame
df_h1 <- df_h %>% mutate(Author1 = temp[ , 1], Author2 = temp[ , 2]) %>% select(Title, Author1, Author2, Publisher, "Publish Year", "Cost USD", Subtitle)
kable(df_h1)
Title Author1 Author2 Publisher Publish Year Cost USD Subtitle
Data Science for Business Foster Provost Tom Fawcett O’Reilly 2013 39.99 What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science Hadley Wickham Garrett Grolemund O’Reilly 2017 39.99 Import, Tidy, Transform, Visualize, and Model Data
R for Everyone Jared P. Lander Addison Wesley 2017 44.99 Advanced Analytics and Graphics

XML file

Next, let’s load the XML file and create a data frame using the xmlToDataFrame function.

# read and parse xml file
url_x <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.xml"
# same as above, can't use xmlParse directly on the URL; so use getURL
raw_x <- xmlParse(getURL(url_x))
class(raw_x)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
# transform xml object into data frame
df_x <- xmlToDataFrame(raw_x, stringsAsFactors = FALSE)
str(df_x)
## 'data.frame':    3 obs. of  6 variables:
##  $ title       : chr  "Data Science for Business" "R for Data Science" "R for Everyone"
##  $ author      : chr  "Foster Provost & Tom Fawcett" "Hadley Wickham & Garrett Grolemund" "Jared P. Lander"
##  $ publisher   : chr  "O'Reilly" "O'Reilly" "Addison Wesley"
##  $ publish_year: chr  "2013" "2017" "2017"
##  $ cost_USD    : chr  "39.99" "39.99" "44.99"
##  $ subtitle    : chr  "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"
kable(df_x)
title author publisher publish_year cost_USD subtitle
Data Science for Business Foster Provost & Tom Fawcett O’Reilly 2013 39.99 What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science Hadley Wickham & Garrett Grolemund O’Reilly 2017 39.99 Import, Tidy, Transform, Visualize, and Model Data
R for Everyone Jared P. Lander Addison Wesley 2017 44.99 Advanced Analytics and Graphics

As before, we can use str_split to separate the two authors into separate fields, and then mutate to create the final data frame

# separate the two authors
temp <- str_split(df_x$author, "&", simplify = TRUE)
df_x1 <- df_x %>% mutate(author1 = temp[ , 1], author2 = temp[ , 2]) %>% select(title, author1, author2, publisher, publish_year, cost_USD, subtitle)
# capitalize column headings
colnames(df_x1) <- str_to_title(colnames(df_x1))
kable(df_x1)
Title Author1 Author2 Publisher Publish_year Cost_usd Subtitle
Data Science for Business Foster Provost Tom Fawcett O’Reilly 2013 39.99 What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science Hadley Wickham Garrett Grolemund O’Reilly 2017 39.99 Import, Tidy, Transform, Visualize, and Model Data
R for Everyone Jared P. Lander Addison Wesley 2017 44.99 Advanced Analytics and Graphics

JSON file

Finally, we load the JSON file and create a data frame using the fromJSON function.

# read json file
url_j <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.json"
raw_j <- fromJSON(url_j)
class(raw_j)
## [1] "list"
# this returned a list, so extract first element as the data frame
df_j <- raw_j[[1]]
str(df_j)
## 'data.frame':    3 obs. of  6 variables:
##  $ title       : chr  "Data Science for Business" "R for Data Science" "R for Everyone"
##  $ author      :List of 3
##   ..$ : chr  "Foster Provost" "Tom Fawcett"
##   ..$ : chr  "Hadley Wickham" "Garrett Grolemund"
##   ..$ : chr "Jared P. Lander"
##  $ publisher   : chr  "O'Reilly" "O'Reilly" "Addison Wesley"
##  $ publish_year: int  2013 2017 2017
##  $ cost_USD    : num  40 40 45
##  $ subtitle    : chr  "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"
kable(df_j)
title author publisher publish_year cost_USD subtitle
Data Science for Business c(“Foster Provost”, “Tom Fawcett”) O’Reilly 2013 39.99 What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science c(“Hadley Wickham”, “Garrett Grolemund”) O’Reilly 2017 39.99 Import, Tidy, Transform, Visualize, and Model Data
R for Everyone Jared P. Lander Addison Wesley 2017 44.99 Advanced Analytics and Graphics

Notice that for the two books with two authors, the two authors now show up as a character vector in the “author” field. This is because in the original JSON file, the two authors are saved as a value array corresponding to the author key. We can separate the two authors into separate fields by splitting the character vector, and then using mutate to create the final data frame.

# separate the two authors
str(df_j$author)
## List of 3
##  $ : chr [1:2] "Foster Provost" "Tom Fawcett"
##  $ : chr [1:2] "Hadley Wickham" "Garrett Grolemund"
##  $ : chr "Jared P. Lander"
# this is a list of vectors, so need to split
n = length(df_j$author)
author1 <- character(length = n)
author2 <- character(length = n)
for (k in 1:n){
    author1[k] <- df_j$author[[k]][1]
    author2[k] <- df_j$author[[k]][2]
}
df_j1 <- df_j %>% mutate(author1, author2) %>% select(title, author1, author2, publisher, publish_year, cost_USD, subtitle)
# capitalize column headings
colnames(df_j1) <- str_to_title(colnames(df_j1))
kable(df_j1)
Title Author1 Author2 Publisher Publish_year Cost_usd Subtitle
Data Science for Business Foster Provost Tom Fawcett O’Reilly 2013 39.99 What You Need to Know About Data Mining and Data-Analytic Thinking
R for Data Science Hadley Wickham Garrett Grolemund O’Reilly 2017 39.99 Import, Tidy, Transform, Visualize, and Model Data
R for Everyone Jared P. Lander NA Addison Wesley 2017 44.99 Advanced Analytics and Graphics

Conclusions

The three final data frames are very similar but not identical. In particular, the data frames differ in how the instance of two authors in the “author” field are handled: