XML

First step is reading in the xml data, and then extracting the relevant information from various nodes by navigating with xpath.

xml_file_url <- 'https://raw.githubusercontent.com/pmahdi/cuny-data-607/main/assignment-5-books.xml'
xml_tree <- read_xml(x = xml_file_url)

all_titles_atomic <- xml_text(
  xml_find_all(
    xml_tree, '/books/book/title'
    )
  )

all_authors_atomic <- xml_text(
  xml_find_all(
    xml_tree, '/books/book/author'
    )
  )

book1_chapters_atomic <- xml_text(
  xml_find_all(
    xml_tree, '/books/book[1]/chapters/chapter'
    )
  )

book2_chapters_atomic <- xml_text(
  xml_find_all(
    xml_tree, '/books/book[2]/chapters/chapter'
    )
  )

book3_chapters_atomic <- xml_text(
  xml_find_all(
    xml_tree, '/books/book[3]/chapters/chapter'
    )
  )

all_pages_atomic <- xml_text(
  xml_find_all(
    xml_tree, '/books/book/pages'
    )
  )

all_formats_atomic <- xml_text(
  xml_find_all(
    xml_tree, '/books/book/format'
    )
  )

With the messy part completed, now it’s a matter of combining and organizing the information in a tidy way. To that end, first I am going to create a tibble that gives basic information on each of the 3 titles.

xml_books <- tibble(
  title = all_titles_atomic, 
  authors = c(2, 1, 1), 
  chapters = c(length(book1_chapters_atomic), 
               length(book2_chapters_atomic), 
               length(book3_chapters_atomic)
               ), 
  pages = as.numeric(all_pages_atomic), 
  format = all_formats_atomic
  )

Next, I will create a tibble that pairs the authors with their respective titles.

xml_authors <- tibble(
  author = c(
    all_authors_atomic[1], 
    all_authors_atomic[4], 
    all_authors_atomic[2], 
    all_authors_atomic[3]
    ), 
  title = c(
    all_titles_atomic[1], 
    all_titles_atomic[3], 
    all_titles_atomic[1], 
    all_titles_atomic[2]
    )
  )

Finally, all that’s left is to create a tibble for pairing titles with their corresponding chapters.

xml_chapters_book1 <- tibble(
  title = all_titles_atomic[1], 
  chapter = book1_chapters_atomic
)

xml_chapters_book2 <- tibble(
  title = all_titles_atomic[2], 
  chapter = book2_chapters_atomic
)

xml_chapters_book3 <- tibble(
  title = all_titles_atomic[3], 
  chapter = book3_chapters_atomic
)

xml_chapters <- bind_rows(xml_chapters_book1, 
                          xml_chapters_book2, 
                          xml_chapters_book3)

The three tibbles xml_books, xml_authors, and xml_chapters present the data stored in the original xml file in a tidy way. I have tried to stick to the philosophy of having each column as a variable, each row as an observation, and each cell as a single value. That is why I have felt it necessary to split the data into 3 separate tibbles. Let’s have a peek at the final result:

xml_books

## # A tibble: 3 × 5
##   title                                             authors chapt…¹ pages format
##   <chr>                                               <dbl>   <int> <dbl> <chr> 
## 1 The Art of Data Science | A Guide for Anyone Who…       2      11   155 pdf   
## 2 Doing Bayesian Data Analysis | A Tutorial with R…       1      23   829 epub  
## 3 R Programming for Data Science                          1      23   176 pdf   
## # … with abbreviated variable name ¹chapters

xml_authors

## # A tibble: 4 × 2
##   author           title                                                        
##   <chr>            <chr>                                                        
## 1 Roger D. Peng    The Art of Data Science | A Guide for Anyone Who Works with …
## 2 Roger D. Peng    R Programming for Data Science                               
## 3 Elizabeth Matsui The Art of Data Science | A Guide for Anyone Who Works with …
## 4 John K. Kruschke Doing Bayesian Data Analysis | A Tutorial with R and BUGS

glimpse(xml_chapters)

## Rows: 57
## Columns: 2
## $ title   <chr> "The Art of Data Science | A Guide for Anyone Who Works with D…
## $ chapter <chr> "Data Analysis as Art", "Epicycles of Analysis", "Stating and …

# Showing the first and last 5 rows of xml_chapters
xml_chapters %>% 
  slice(c(head(row_number(), 5), tail(row_number(), 5)))

## # A tibble: 10 × 2
##    title                                                            chapter     
##    <chr>                                                            <chr>       
##  1 The Art of Data Science | A Guide for Anyone Who Works with Data Data Analys…
##  2 The Art of Data Science | A Guide for Anyone Who Works with Data Epicycles o…
##  3 The Art of Data Science | A Guide for Anyone Who Works with Data Stating and…
##  4 The Art of Data Science | A Guide for Anyone Who Works with Data Exploratory…
##  5 The Art of Data Science | A Guide for Anyone Who Works with Data Using Model…
##  6 R Programming for Data Science                                   Profiling R…
##  7 R Programming for Data Science                                   Simulation  
##  8 R Programming for Data Science                                   Data Analys…
##  9 R Programming for Data Science                                   Parallel Co…
## 10 R Programming for Data Science                                   Why I Inden…

JSON

First and foremost, the json file has to be read into the R environment.

json_file_url <- 'https://raw.githubusercontent.com/pmahdi/cuny-data-607/main/assignment-5-books.json'
json_list <- fromJSON(file = json_file_url)

As seen previously, the information I have chosen to include about my books requires me to create multiple tibbles to present the data in a tidy way. However, it is possible to organize the information in a single data structure if the tidy philosophy is disregarded. For example:

json_dirty <- sapply(json_list, `[`)
knitr::kable(json_dirty)

	book_1	book_2	book_3
title	The Art of Data Science \| A Guide for Anyone Who Works with Data	Doing Bayesian Data Analysis \| A Tutorial with R and BUGS	R Programming for Data Science
author	c(“Roger D. Peng”, “Elizabeth Matsui”)	John K. Kruschke	Roger D. Peng
chapters	c(“Data Analysis as Art”, “Epicycles of Analysis”, “Stating and Refining the Question”, “Exploratory Data Analysis”, “Using Models to Explore Your Data”, “Inference: A Primer”, “Formal Modeling”, “Inference vs. Prediction: Implications for Modeling Strategy”, “Interpreting Your Results”, “Communication”, “Concluding Thoughts”)	c(“This Book’s Organization”, “Introduction”, “What is This Stuff Called Probability”, “Bayes’ Rule”, “Inferring a Binomial Proportion via Exact Mathematical Analysis”, “Inferring a Binomial Proportion via Grid Approximation”, “Inferring a Binomial Proportion via the Metropolis Algorithm”, “Inferring Two Binomial Proportions via Gibbs Sampling”, “Bernoulli Likelihood with Hierarchical Prior”, “Hierarchical Modeling and Model Comparison”, “Null Hypothesis Significance Testing”, “Bayesian Approaches to Testing a Point (Null) Hypothesis”,
“Goals, Power, and Sample Size”, “Overview of the Generalized Linear Model”, “Metric Predicted Variable on a Single Group”, “Metric Predicted Variable with One Metric Predictor”, “Metric Predicted Variable with Multiple Metric Predictors”, “Metric Predicted Variable with One Nominal Predictor”, “Metric Predicted Variable with Multiple Nominal Predictors”, “Dichotomous Predicted Variable”, “Ordinal Predicted Variable”, “Contingency Table Analysis”, “Tools in the Trunk”)	c(“History and Overview of R”, “Getting Started with R”, “R Nuts and Bolts”, “Getting Data In and Out of R”, “Using the readr Package”, “Using Textual and Binary Formats for Storing Data”, “Interfaces to the Outside World”, “Subsetting R Objects”, “Vectorized Operations”, “Dates and Times”, “Managing Data Frames with the dplyr package”, “Control Structures”, “Functions”, “Scoping Rules of R”, “Coding Standards for R”, “Loop Functions”, “Regular Expressions”, “Debugging”, “Profiling R Code”, “Simulation”,
“Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S.”, “Parallel Computation”, “Why I Indent My Code 8 Spaces”)
pages	155	829	176
format	pdf	epub	pdf

This is an extremely ugly way to organize and display the information. So, I will now tidy the data step by step as I did in the XML section.

# Getting all the titles in one character vector
json_title_atomic <- character()
for (i in c(1, 6, 11)) {
  json_title_atomic <- append(x = json_title_atomic, values = json_dirty[[i]])
}
# Getting all the authors in one character vector
json_author_atomic <- character()
for (i in c(2, 7, 12)) {
  json_author_atomic <- 
    append(x = json_author_atomic, values = json_dirty[[i]])
}
json_author_atomic[c(2, 4)] <- json_author_atomic[c(4, 2)]
# Extracting the chapters of each book separately, then matching them with their corresponding title
json_book_1_chapters <- tibble(title = json_title_atomic[1], 
                               chapter = json_dirty[[3]])
json_book_2_chapters <- tibble(title = json_title_atomic[2], 
                               chapter = json_dirty[[8]])
json_book_3_chapters <- tibble(title = json_title_atomic[3], 
                               chapter = json_dirty[[13]])

# Getting all the page counts in one character vector
json_pages_atomic <- numeric()
for (i in c(4, 9, 14)) {
  json_pages_atomic <- 
    append(x = json_pages_atomic, values = json_dirty[[i]])
}
# Getting all the publication formats in one character vector
json_format_atomic <- character()
for (i in c(5, 10, 15)) {
  json_format_atomic <- 
    append(x = json_format_atomic, values = json_dirty[[i]])
}

With all the necessary pieces parsed, it’s time to recombine them to create nice and tidy tibbles.

json_books <- tibble(
  title = json_title_atomic, 
  authors = c(2, 1, 1), 
  chapters = c(length(json_dirty[[3]]), length(json_dirty[[8]]), 
               length(json_dirty[[13]])), 
  pages = json_pages_atomic, 
  format = json_format_atomic
)

json_authors <- tibble(
  author = json_author_atomic, 
  title = c(json_title_atomic[1], json_title_atomic[3], json_title_atomic[2], 
            json_title_atomic[1])
)

json_chapters <- bind_rows(json_book_1_chapters, 
                          json_book_2_chapters, 
                          json_book_3_chapters)

json_books

## # A tibble: 3 × 5
##   title                                             authors chapt…¹ pages format
##   <chr>                                               <dbl>   <int> <dbl> <chr> 
## 1 The Art of Data Science | A Guide for Anyone Who…       2      11   155 pdf   
## 2 Doing Bayesian Data Analysis | A Tutorial with R…       1      23   829 epub  
## 3 R Programming for Data Science                          1      23   176 pdf   
## # … with abbreviated variable name ¹chapters

json_authors

## # A tibble: 4 × 2
##   author           title                                                        
##   <chr>            <chr>                                                        
## 1 Roger D. Peng    The Art of Data Science | A Guide for Anyone Who Works with …
## 2 Roger D. Peng    R Programming for Data Science                               
## 3 John K. Kruschke Doing Bayesian Data Analysis | A Tutorial with R and BUGS    
## 4 Elizabeth Matsui The Art of Data Science | A Guide for Anyone Who Works with …

glimpse(json_chapters)

## Rows: 57
## Columns: 2
## $ title   <chr> "The Art of Data Science | A Guide for Anyone Who Works with D…
## $ chapter <chr> "Data Analysis as Art", "Epicycles of Analysis", "Stating and …

# Showing the first and last 5 rows of json_chapters
json_chapters %>%
  slice(c(head(row_number(), 5), tail(row_number(), 5)))

## # A tibble: 10 × 2
##    title                                                            chapter     
##    <chr>                                                            <chr>       
##  1 The Art of Data Science | A Guide for Anyone Who Works with Data Data Analys…
##  2 The Art of Data Science | A Guide for Anyone Who Works with Data Epicycles o…
##  3 The Art of Data Science | A Guide for Anyone Who Works with Data Stating and…
##  4 The Art of Data Science | A Guide for Anyone Who Works with Data Exploratory…
##  5 The Art of Data Science | A Guide for Anyone Who Works with Data Using Model…
##  6 R Programming for Data Science                                   Profiling R…
##  7 R Programming for Data Science                                   Simulation  
##  8 R Programming for Data Science                                   Data Analys…
##  9 R Programming for Data Science                                   Parallel Co…
## 10 R Programming for Data Science                                   Why I Inden…

HTML

Reading HTML data into R and creating tidy data structures is far easier than doing so with either JSON or XML. A very convenient package htmltab exists for this purpose, and I plan to use it, even though doing so almost feels like cheating because it’s so easy.

html_file_url = 'https://raw.githubusercontent.com/pmahdi/cuny-data-607/main/assignment-5-books.html'
html_books <- as_tibble(htmltab(doc = html_file_url, which = 1))
html_authors <- as_tibble(htmltab(doc = html_file_url, which = 2))
html_chapters <- as_tibble(htmltab(doc = html_file_url, which = 3))

# Adding some extra information to html_books and shuffling the column order of html_authors
html_books <- html_books %>% 
  mutate(authors = c(2, 1, 1), chapters = c(11, 23, 23), .after = title)
html_authors[c(1, 2)] <- html_authors[c(2, 1)]
colnames(html_authors) <- c('author', 'title')

html_books

## # A tibble: 3 × 5
##   title                                             authors chapt…¹ pages format
##   <chr>                                               <dbl>   <dbl> <chr> <chr> 
## 1 The Art of Data Science | A Guide for Anyone Who…       2      11 155   pdf   
## 2 Doing Bayesian Data Analysis | A Tutorial with R…       1      23 829   epub  
## 3 R Programming for Data Science                          1      23 176   pdf   
## # … with abbreviated variable name ¹chapters

html_authors

## # A tibble: 4 × 2
##   author           title                                                        
##   <chr>            <chr>                                                        
## 1 Roger D. Peng    The Art of Data Science | A Guide for Anyone Who Works with …
## 2 Elizabeth Matsui The Art of Data Science | A Guide for Anyone Who Works with …
## 3 John K. Kruschke Doing Bayesian Data Analysis | A Tutorial with R and BUGS    
## 4 Roger D. Peng    R Programming for Data Science

glimpse(html_chapters)

## Rows: 57
## Columns: 2
## $ title   <chr> "The Art of Data Science | A Guide for Anyone Who Works with D…
## $ chapter <chr> "Data Analysis as Art", "Epicycles of Analysis", "Stating and …

# Showing the first and last 5 rows of html_chapters
html_chapters %>%
  slice(c(head(row_number(), 5), tail(row_number(), 5)))

## # A tibble: 10 × 2
##    title                                                            chapter     
##    <chr>                                                            <chr>       
##  1 The Art of Data Science | A Guide for Anyone Who Works with Data Data Analys…
##  2 The Art of Data Science | A Guide for Anyone Who Works with Data Epicycles o…
##  3 The Art of Data Science | A Guide for Anyone Who Works with Data Stating and…
##  4 The Art of Data Science | A Guide for Anyone Who Works with Data Exploratory…
##  5 The Art of Data Science | A Guide for Anyone Who Works with Data Using Model…
##  6 R Programming for Data Science                                   Profiling R…
##  7 R Programming for Data Science                                   Simulation  
##  8 R Programming for Data Science                                   Data Analys…
##  9 R Programming for Data Science                                   Parallel Co…
## 10 R Programming for Data Science                                   Why I Inden…

Assignment 5: Working with XML and JSON in R

Prinon Mahdi

XML

JSON

HTML