For this assignment, I chose three books from our assigned or suggested reading, and created HTML, XML, and JSON files containing the following fields:
The three files are saved on GitHub.
To start, we load several libraries that we will use below.
# load libraries
library(tidyverse)
library(knitr)
library(RCurl)
library(XML)
library(jsonlite)
First, let’s load the HTML file and create a data frame using the readHTMLTable function.
# read and parse html file
url_h <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.html"
# for some reason, can't use htmlParse directly on the URL; so use getURL
raw_h <- htmlParse(getURL(url_h))
class(raw_h)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"
## [4] "XMLAbstractDocument"
# read html table
raw_h1 <- readHTMLTable(raw_h, stringsAsFactors = FALSE)
class(raw_h1)
## [1] "list"
# this returned a list, so extract first element as the data frame
df_h <- raw_h1[[1]]
str(df_h)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Data Science for Business" "R for Data Science" "R for Everyone"
## $ Author : chr "Foster Provost & Tom Fawcett" "Hadley Wickham & Garrett Grolemund" "Jared P. Lander"
## $ Publisher : chr "O'Reilly" "O'Reilly" "Addison Wesley"
## $ Publish Year: chr "2013" "2017" "2017"
## $ Cost USD : chr "39.99" "39.99" "44.99"
## $ Subtitle : chr "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"
kable(df_h)
| Title | Author | Publisher | Publish Year | Cost USD | Subtitle |
|---|---|---|---|---|---|
| Data Science for Business | Foster Provost & Tom Fawcett | O’Reilly | 2013 | 39.99 | What You Need to Know About Data Mining and Data-Analytic Thinking |
| R for Data Science | Hadley Wickham & Garrett Grolemund | O’Reilly | 2017 | 39.99 | Import, Tidy, Transform, Visualize, and Model Data |
| R for Everyone | Jared P. Lander | Addison Wesley | 2017 | 44.99 | Advanced Analytics and Graphics |
Notice that for the two books with two authors, the two authors show up in the same “Author” field, separated by an “&”. To separate the authors, we use str_split and mutate to create the final data frame.
# separate the two authors
temp <- str_split(df_h$Author, "&", simplify = TRUE)
temp
## [,1] [,2]
## [1,] "Foster Provost " " Tom Fawcett"
## [2,] "Hadley Wickham " " Garrett Grolemund"
## [3,] "Jared P. Lander" ""
# final data frame
df_h1 <- df_h %>% mutate(Author1 = temp[ , 1], Author2 = temp[ , 2]) %>% select(Title, Author1, Author2, Publisher, "Publish Year", "Cost USD", Subtitle)
kable(df_h1)
| Title | Author1 | Author2 | Publisher | Publish Year | Cost USD | Subtitle |
|---|---|---|---|---|---|---|
| Data Science for Business | Foster Provost | Tom Fawcett | O’Reilly | 2013 | 39.99 | What You Need to Know About Data Mining and Data-Analytic Thinking |
| R for Data Science | Hadley Wickham | Garrett Grolemund | O’Reilly | 2017 | 39.99 | Import, Tidy, Transform, Visualize, and Model Data |
| R for Everyone | Jared P. Lander | Addison Wesley | 2017 | 44.99 | Advanced Analytics and Graphics |
Next, let’s load the XML file and create a data frame using the xmlToDataFrame function.
# read and parse xml file
url_x <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.xml"
# same as above, can't use xmlParse directly on the URL; so use getURL
raw_x <- xmlParse(getURL(url_x))
class(raw_x)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
# transform xml object into data frame
df_x <- xmlToDataFrame(raw_x, stringsAsFactors = FALSE)
str(df_x)
## 'data.frame': 3 obs. of 6 variables:
## $ title : chr "Data Science for Business" "R for Data Science" "R for Everyone"
## $ author : chr "Foster Provost & Tom Fawcett" "Hadley Wickham & Garrett Grolemund" "Jared P. Lander"
## $ publisher : chr "O'Reilly" "O'Reilly" "Addison Wesley"
## $ publish_year: chr "2013" "2017" "2017"
## $ cost_USD : chr "39.99" "39.99" "44.99"
## $ subtitle : chr "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"
kable(df_x)
| title | author | publisher | publish_year | cost_USD | subtitle |
|---|---|---|---|---|---|
| Data Science for Business | Foster Provost & Tom Fawcett | O’Reilly | 2013 | 39.99 | What You Need to Know About Data Mining and Data-Analytic Thinking |
| R for Data Science | Hadley Wickham & Garrett Grolemund | O’Reilly | 2017 | 39.99 | Import, Tidy, Transform, Visualize, and Model Data |
| R for Everyone | Jared P. Lander | Addison Wesley | 2017 | 44.99 | Advanced Analytics and Graphics |
As before, we can use str_split to separate the two authors into separate fields, and then mutate to create the final data frame
# separate the two authors
temp <- str_split(df_x$author, "&", simplify = TRUE)
df_x1 <- df_x %>% mutate(author1 = temp[ , 1], author2 = temp[ , 2]) %>% select(title, author1, author2, publisher, publish_year, cost_USD, subtitle)
# capitalize column headings
colnames(df_x1) <- str_to_title(colnames(df_x1))
kable(df_x1)
| Title | Author1 | Author2 | Publisher | Publish_year | Cost_usd | Subtitle |
|---|---|---|---|---|---|---|
| Data Science for Business | Foster Provost | Tom Fawcett | O’Reilly | 2013 | 39.99 | What You Need to Know About Data Mining and Data-Analytic Thinking |
| R for Data Science | Hadley Wickham | Garrett Grolemund | O’Reilly | 2017 | 39.99 | Import, Tidy, Transform, Visualize, and Model Data |
| R for Everyone | Jared P. Lander | Addison Wesley | 2017 | 44.99 | Advanced Analytics and Graphics |
Finally, we load the JSON file and create a data frame using the fromJSON function.
# read json file
url_j <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Wk7/master/books.json"
raw_j <- fromJSON(url_j)
class(raw_j)
## [1] "list"
# this returned a list, so extract first element as the data frame
df_j <- raw_j[[1]]
str(df_j)
## 'data.frame': 3 obs. of 6 variables:
## $ title : chr "Data Science for Business" "R for Data Science" "R for Everyone"
## $ author :List of 3
## ..$ : chr "Foster Provost" "Tom Fawcett"
## ..$ : chr "Hadley Wickham" "Garrett Grolemund"
## ..$ : chr "Jared P. Lander"
## $ publisher : chr "O'Reilly" "O'Reilly" "Addison Wesley"
## $ publish_year: int 2013 2017 2017
## $ cost_USD : num 40 40 45
## $ subtitle : chr "What You Need to Know About Data Mining and Data-Analytic Thinking" "Import, Tidy, Transform, Visualize, and Model Data" "Advanced Analytics and Graphics"
kable(df_j)
| title | author | publisher | publish_year | cost_USD | subtitle |
|---|---|---|---|---|---|
| Data Science for Business | c(“Foster Provost”, “Tom Fawcett”) | O’Reilly | 2013 | 39.99 | What You Need to Know About Data Mining and Data-Analytic Thinking |
| R for Data Science | c(“Hadley Wickham”, “Garrett Grolemund”) | O’Reilly | 2017 | 39.99 | Import, Tidy, Transform, Visualize, and Model Data |
| R for Everyone | Jared P. Lander | Addison Wesley | 2017 | 44.99 | Advanced Analytics and Graphics |
Notice that for the two books with two authors, the two authors now show up as a character vector in the “author” field. This is because in the original JSON file, the two authors are saved as a value array corresponding to the author key. We can separate the two authors into separate fields by splitting the character vector, and then using mutate to create the final data frame.
# separate the two authors
str(df_j$author)
## List of 3
## $ : chr [1:2] "Foster Provost" "Tom Fawcett"
## $ : chr [1:2] "Hadley Wickham" "Garrett Grolemund"
## $ : chr "Jared P. Lander"
# this is a list of vectors, so need to split
n = length(df_j$author)
author1 <- character(length = n)
author2 <- character(length = n)
for (k in 1:n){
author1[k] <- df_j$author[[k]][1]
author2[k] <- df_j$author[[k]][2]
}
df_j1 <- df_j %>% mutate(author1, author2) %>% select(title, author1, author2, publisher, publish_year, cost_USD, subtitle)
# capitalize column headings
colnames(df_j1) <- str_to_title(colnames(df_j1))
kable(df_j1)
| Title | Author1 | Author2 | Publisher | Publish_year | Cost_usd | Subtitle |
|---|---|---|---|---|---|---|
| Data Science for Business | Foster Provost | Tom Fawcett | O’Reilly | 2013 | 39.99 | What You Need to Know About Data Mining and Data-Analytic Thinking |
| R for Data Science | Hadley Wickham | Garrett Grolemund | O’Reilly | 2017 | 39.99 | Import, Tidy, Transform, Visualize, and Model Data |
| R for Everyone | Jared P. Lander | NA | Addison Wesley | 2017 | 44.99 | Advanced Analytics and Graphics |
The three final data frames are very similar but not identical. In particular, the data frames differ in how the instance of two authors in the “author” field are handled: