First step is reading in the xml data, and then extracting the relevant information from various nodes by navigating with xpath.
xml_file_url <- 'https://raw.githubusercontent.com/pmahdi/cuny-data-607/main/assignment-5-books.xml'
xml_tree <- read_xml(x = xml_file_url)
all_titles_atomic <- xml_text(
xml_find_all(
xml_tree, '/books/book/title'
)
)
all_authors_atomic <- xml_text(
xml_find_all(
xml_tree, '/books/book/author'
)
)
book1_chapters_atomic <- xml_text(
xml_find_all(
xml_tree, '/books/book[1]/chapters/chapter'
)
)
book2_chapters_atomic <- xml_text(
xml_find_all(
xml_tree, '/books/book[2]/chapters/chapter'
)
)
book3_chapters_atomic <- xml_text(
xml_find_all(
xml_tree, '/books/book[3]/chapters/chapter'
)
)
all_pages_atomic <- xml_text(
xml_find_all(
xml_tree, '/books/book/pages'
)
)
all_formats_atomic <- xml_text(
xml_find_all(
xml_tree, '/books/book/format'
)
)
With the messy part completed, now it’s a matter of combining and organizing the information in a tidy way. To that end, first I am going to create a tibble that gives basic information on each of the 3 titles.
xml_books <- tibble(
title = all_titles_atomic,
authors = c(2, 1, 1),
chapters = c(length(book1_chapters_atomic),
length(book2_chapters_atomic),
length(book3_chapters_atomic)
),
pages = as.numeric(all_pages_atomic),
format = all_formats_atomic
)
Next, I will create a tibble that pairs the authors with their respective titles.
xml_authors <- tibble(
author = c(
all_authors_atomic[1],
all_authors_atomic[4],
all_authors_atomic[2],
all_authors_atomic[3]
),
title = c(
all_titles_atomic[1],
all_titles_atomic[3],
all_titles_atomic[1],
all_titles_atomic[2]
)
)
Finally, all that’s left is to create a tibble for pairing titles with their corresponding chapters.
xml_chapters_book1 <- tibble(
title = all_titles_atomic[1],
chapter = book1_chapters_atomic
)
xml_chapters_book2 <- tibble(
title = all_titles_atomic[2],
chapter = book2_chapters_atomic
)
xml_chapters_book3 <- tibble(
title = all_titles_atomic[3],
chapter = book3_chapters_atomic
)
xml_chapters <- bind_rows(xml_chapters_book1,
xml_chapters_book2,
xml_chapters_book3)
The three tibbles xml_books, xml_authors,
and xml_chapters present the data stored in the original
xml file in a tidy way. I have tried to stick to the philosophy of
having each column as a variable, each row as an observation, and each
cell as a single value. That is why I have felt it necessary to split
the data into 3 separate tibbles. Let’s have a peek at the final
result:
xml_books
## # A tibble: 3 × 5
## title authors chapt…¹ pages format
## <chr> <dbl> <int> <dbl> <chr>
## 1 The Art of Data Science | A Guide for Anyone Who… 2 11 155 pdf
## 2 Doing Bayesian Data Analysis | A Tutorial with R… 1 23 829 epub
## 3 R Programming for Data Science 1 23 176 pdf
## # … with abbreviated variable name ¹chapters
xml_authors
## # A tibble: 4 × 2
## author title
## <chr> <chr>
## 1 Roger D. Peng The Art of Data Science | A Guide for Anyone Who Works with …
## 2 Roger D. Peng R Programming for Data Science
## 3 Elizabeth Matsui The Art of Data Science | A Guide for Anyone Who Works with …
## 4 John K. Kruschke Doing Bayesian Data Analysis | A Tutorial with R and BUGS
glimpse(xml_chapters)
## Rows: 57
## Columns: 2
## $ title <chr> "The Art of Data Science | A Guide for Anyone Who Works with D…
## $ chapter <chr> "Data Analysis as Art", "Epicycles of Analysis", "Stating and …
# Showing the first and last 5 rows of xml_chapters
xml_chapters %>%
slice(c(head(row_number(), 5), tail(row_number(), 5)))
## # A tibble: 10 × 2
## title chapter
## <chr> <chr>
## 1 The Art of Data Science | A Guide for Anyone Who Works with Data Data Analys…
## 2 The Art of Data Science | A Guide for Anyone Who Works with Data Epicycles o…
## 3 The Art of Data Science | A Guide for Anyone Who Works with Data Stating and…
## 4 The Art of Data Science | A Guide for Anyone Who Works with Data Exploratory…
## 5 The Art of Data Science | A Guide for Anyone Who Works with Data Using Model…
## 6 R Programming for Data Science Profiling R…
## 7 R Programming for Data Science Simulation
## 8 R Programming for Data Science Data Analys…
## 9 R Programming for Data Science Parallel Co…
## 10 R Programming for Data Science Why I Inden…
First and foremost, the json file has to be read into the R environment.
json_file_url <- 'https://raw.githubusercontent.com/pmahdi/cuny-data-607/main/assignment-5-books.json'
json_list <- fromJSON(file = json_file_url)
As seen previously, the information I have chosen to include about my books requires me to create multiple tibbles to present the data in a tidy way. However, it is possible to organize the information in a single data structure if the tidy philosophy is disregarded. For example:
json_dirty <- sapply(json_list, `[`)
knitr::kable(json_dirty)
| book_1 | book_2 | book_3 | |
|---|---|---|---|
| title | The Art of Data Science | A Guide for Anyone Who Works with Data | Doing Bayesian Data Analysis | A Tutorial with R and BUGS | R Programming for Data Science |
| author | c(“Roger D. Peng”, “Elizabeth Matsui”) | John K. Kruschke | Roger D. Peng |
| chapters | c(“Data Analysis as Art”, “Epicycles of Analysis”, “Stating and Refining the Question”, “Exploratory Data Analysis”, “Using Models to Explore Your Data”, “Inference: A Primer”, “Formal Modeling”, “Inference vs. Prediction: Implications for Modeling Strategy”, “Interpreting Your Results”, “Communication”, “Concluding Thoughts”) | c(“This Book’s Organization”, “Introduction”, “What is This Stuff Called Probability”, “Bayes’ Rule”, “Inferring a Binomial Proportion via Exact Mathematical Analysis”, “Inferring a Binomial Proportion via Grid Approximation”, “Inferring a Binomial Proportion via the Metropolis Algorithm”, “Inferring Two Binomial Proportions via Gibbs Sampling”, “Bernoulli Likelihood with Hierarchical Prior”, “Hierarchical Modeling and Model Comparison”, “Null Hypothesis Significance Testing”, “Bayesian Approaches to Testing a Point (Null) Hypothesis”, | |
| “Goals, Power, and Sample Size”, “Overview of the Generalized Linear Model”, “Metric Predicted Variable on a Single Group”, “Metric Predicted Variable with One Metric Predictor”, “Metric Predicted Variable with Multiple Metric Predictors”, “Metric Predicted Variable with One Nominal Predictor”, “Metric Predicted Variable with Multiple Nominal Predictors”, “Dichotomous Predicted Variable”, “Ordinal Predicted Variable”, “Contingency Table Analysis”, “Tools in the Trunk”) | c(“History and Overview of R”, “Getting Started with R”, “R Nuts and Bolts”, “Getting Data In and Out of R”, “Using the readr Package”, “Using Textual and Binary Formats for Storing Data”, “Interfaces to the Outside World”, “Subsetting R Objects”, “Vectorized Operations”, “Dates and Times”, “Managing Data Frames with the dplyr package”, “Control Structures”, “Functions”, “Scoping Rules of R”, “Coding Standards for R”, “Loop Functions”, “Regular Expressions”, “Debugging”, “Profiling R Code”, “Simulation”, | ||
| “Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S.”, “Parallel Computation”, “Why I Indent My Code 8 Spaces”) | |||
| pages | 155 | 829 | 176 |
| format | epub |
This is an extremely ugly way to organize and display the information. So, I will now tidy the data step by step as I did in the XML section.
# Getting all the titles in one character vector
json_title_atomic <- character()
for (i in c(1, 6, 11)) {
json_title_atomic <- append(x = json_title_atomic, values = json_dirty[[i]])
}
# Getting all the authors in one character vector
json_author_atomic <- character()
for (i in c(2, 7, 12)) {
json_author_atomic <-
append(x = json_author_atomic, values = json_dirty[[i]])
}
json_author_atomic[c(2, 4)] <- json_author_atomic[c(4, 2)]
# Extracting the chapters of each book separately, then matching them with their corresponding title
json_book_1_chapters <- tibble(title = json_title_atomic[1],
chapter = json_dirty[[3]])
json_book_2_chapters <- tibble(title = json_title_atomic[2],
chapter = json_dirty[[8]])
json_book_3_chapters <- tibble(title = json_title_atomic[3],
chapter = json_dirty[[13]])
# Getting all the page counts in one character vector
json_pages_atomic <- numeric()
for (i in c(4, 9, 14)) {
json_pages_atomic <-
append(x = json_pages_atomic, values = json_dirty[[i]])
}
# Getting all the publication formats in one character vector
json_format_atomic <- character()
for (i in c(5, 10, 15)) {
json_format_atomic <-
append(x = json_format_atomic, values = json_dirty[[i]])
}
With all the necessary pieces parsed, it’s time to recombine them to create nice and tidy tibbles.
json_books <- tibble(
title = json_title_atomic,
authors = c(2, 1, 1),
chapters = c(length(json_dirty[[3]]), length(json_dirty[[8]]),
length(json_dirty[[13]])),
pages = json_pages_atomic,
format = json_format_atomic
)
json_authors <- tibble(
author = json_author_atomic,
title = c(json_title_atomic[1], json_title_atomic[3], json_title_atomic[2],
json_title_atomic[1])
)
json_chapters <- bind_rows(json_book_1_chapters,
json_book_2_chapters,
json_book_3_chapters)
json_books
## # A tibble: 3 × 5
## title authors chapt…¹ pages format
## <chr> <dbl> <int> <dbl> <chr>
## 1 The Art of Data Science | A Guide for Anyone Who… 2 11 155 pdf
## 2 Doing Bayesian Data Analysis | A Tutorial with R… 1 23 829 epub
## 3 R Programming for Data Science 1 23 176 pdf
## # … with abbreviated variable name ¹chapters
json_authors
## # A tibble: 4 × 2
## author title
## <chr> <chr>
## 1 Roger D. Peng The Art of Data Science | A Guide for Anyone Who Works with …
## 2 Roger D. Peng R Programming for Data Science
## 3 John K. Kruschke Doing Bayesian Data Analysis | A Tutorial with R and BUGS
## 4 Elizabeth Matsui The Art of Data Science | A Guide for Anyone Who Works with …
glimpse(json_chapters)
## Rows: 57
## Columns: 2
## $ title <chr> "The Art of Data Science | A Guide for Anyone Who Works with D…
## $ chapter <chr> "Data Analysis as Art", "Epicycles of Analysis", "Stating and …
# Showing the first and last 5 rows of json_chapters
json_chapters %>%
slice(c(head(row_number(), 5), tail(row_number(), 5)))
## # A tibble: 10 × 2
## title chapter
## <chr> <chr>
## 1 The Art of Data Science | A Guide for Anyone Who Works with Data Data Analys…
## 2 The Art of Data Science | A Guide for Anyone Who Works with Data Epicycles o…
## 3 The Art of Data Science | A Guide for Anyone Who Works with Data Stating and…
## 4 The Art of Data Science | A Guide for Anyone Who Works with Data Exploratory…
## 5 The Art of Data Science | A Guide for Anyone Who Works with Data Using Model…
## 6 R Programming for Data Science Profiling R…
## 7 R Programming for Data Science Simulation
## 8 R Programming for Data Science Data Analys…
## 9 R Programming for Data Science Parallel Co…
## 10 R Programming for Data Science Why I Inden…
Reading HTML data into R and creating tidy data structures is far
easier than doing so with either JSON or XML. A very convenient package
htmltab exists for this purpose, and I plan to use it, even
though doing so almost feels like cheating because it’s so easy.
html_file_url = 'https://raw.githubusercontent.com/pmahdi/cuny-data-607/main/assignment-5-books.html'
html_books <- as_tibble(htmltab(doc = html_file_url, which = 1))
html_authors <- as_tibble(htmltab(doc = html_file_url, which = 2))
html_chapters <- as_tibble(htmltab(doc = html_file_url, which = 3))
# Adding some extra information to html_books and shuffling the column order of html_authors
html_books <- html_books %>%
mutate(authors = c(2, 1, 1), chapters = c(11, 23, 23), .after = title)
html_authors[c(1, 2)] <- html_authors[c(2, 1)]
colnames(html_authors) <- c('author', 'title')
html_books
## # A tibble: 3 × 5
## title authors chapt…¹ pages format
## <chr> <dbl> <dbl> <chr> <chr>
## 1 The Art of Data Science | A Guide for Anyone Who… 2 11 155 pdf
## 2 Doing Bayesian Data Analysis | A Tutorial with R… 1 23 829 epub
## 3 R Programming for Data Science 1 23 176 pdf
## # … with abbreviated variable name ¹chapters
html_authors
## # A tibble: 4 × 2
## author title
## <chr> <chr>
## 1 Roger D. Peng The Art of Data Science | A Guide for Anyone Who Works with …
## 2 Elizabeth Matsui The Art of Data Science | A Guide for Anyone Who Works with …
## 3 John K. Kruschke Doing Bayesian Data Analysis | A Tutorial with R and BUGS
## 4 Roger D. Peng R Programming for Data Science
glimpse(html_chapters)
## Rows: 57
## Columns: 2
## $ title <chr> "The Art of Data Science | A Guide for Anyone Who Works with D…
## $ chapter <chr> "Data Analysis as Art", "Epicycles of Analysis", "Stating and …
# Showing the first and last 5 rows of html_chapters
html_chapters %>%
slice(c(head(row_number(), 5), tail(row_number(), 5)))
## # A tibble: 10 × 2
## title chapter
## <chr> <chr>
## 1 The Art of Data Science | A Guide for Anyone Who Works with Data Data Analys…
## 2 The Art of Data Science | A Guide for Anyone Who Works with Data Epicycles o…
## 3 The Art of Data Science | A Guide for Anyone Who Works with Data Stating and…
## 4 The Art of Data Science | A Guide for Anyone Who Works with Data Exploratory…
## 5 The Art of Data Science | A Guide for Anyone Who Works with Data Using Model…
## 6 R Programming for Data Science Profiling R…
## 7 R Programming for Data Science Simulation
## 8 R Programming for Data Science Data Analys…
## 9 R Programming for Data Science Parallel Co…
## 10 R Programming for Data Science Why I Inden…