Working with XML and JSON in R.
Books of Choice
For this assignment, I selected three books that were part of my
“recent reads” from the summer: “Getting to Yes” by Roger Fisher,
William L. Ury, and Bruce Patton; “Diary of a CEO” by Steven Bartlett;
and “Psychology of Money” by Morgan Housel. I chose these books because
they became a significant part of my post-graduation journey. After
completing my B.S. degree and facing setbacks in my original career
plans, I found myself at a crossroads, contemplating whether to pursue a
degree in Data Science.
With no academic assignments to occupy my mind, I craved constant
mental stimulation. Engaging in activities that challenged me and
allowed for continuous learning became essential. Reading books that
promised mental stimulation and skill development seemed like a natural
choice. Throughout my college career, I had explored diverse subjects
such as finance, business/marketing, medicine, and now data
analytics.
These three books proved to be invaluable resources, offering
insights into various aspects of life, particularly mental health and
the profound influence of psychology in our daily experiences. They
provided me with a deeper understanding of the role psychology plays
beyond moments of anxiety or depression. Each book enriched my
perspective and equipped me with valuable skills that I believe will be
instrumental in shaping my future endeavors.
Dataframe 1: HTML
library(rvest)
library(xml2)
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:jsonlite':
##
## flatten
html_content <- '
<table>
<tr>
<th>Title</th>
<th>Author(s)</th>
<th>Notes</th>
</tr>
<tr>
<td>Getting to Yes</td>
<td>Roger Fisher, William L. Ury, Bruce Patton</td>
<td>Negotiation principles and strategies.</td>
</tr>
<tr>
<td>Diary of a CEO</td>
<td>Steven Bartlett</td>
<td>Insights into entrepreneurship and leadership.</td>
</tr>
<tr>
<td>Psychology of Money</td>
<td>Morgan Housel</td>
<td>Behavioral economics and personal finance.</td>
</tr>
</table>
'
writeLines(html_content, "books.html")
Dataframe 2: XML
xml_content <- '
<books>
<book>
<title>Getting to Yes</title>
<author>Roger Fisher, William L. Ury, Bruce Patton</author>
<notes>Negotiation principles and strategies.</notes>
</book>
<book>
<title>Diary of a CEO</title>
<author>Steven Bartlett</author>
<notes>Insights into entrepreneurship and leadership.</notes>
</book>
<book>
<title>Psychology of Money</title>
<author>Morgan Housel</author>
<notes>Behavioral economics and personal finance.</notes>
</book>
</books>
'
writeLines(xml_content, "books.xml")
Dataframe 3: JSON
json_content <- '[
{
"title": "Getting to Yes",
"author": "Roger Fisher, William L. Ury, Bruce Patton",
"notes": "Negotiation principles and strategies."
},
{
"title": "Diary of a CEO",
"author": "Steven Bartlett",
"notes": "Insights into entrepreneurship and leadership."
},
{
"title": "Psychology of Money",
"author": "Morgan Housel",
"notes": "Behavioral economics and personal finance."
}
]'
writeLines(json_content, "books.json")
Loading the Content & Parsing the Data from All Three
Dataframes
HTML
html_data <- read_html("books.html")
html_df <- html_data %>%
html_nodes("table") %>%
html_table() %>%
.[[1]]
Creating the Column Names in the HTML Dataframe.
colnames(html_df) <- c("title", "author", "notes")
html_df <- data.frame(lapply(html_df, as.character), stringsAsFactors = FALSE)
html_df <- html_df[order(html_df$title), ]
Loading and Parsing XML.
xml_data <- read_xml("books.xml")
xml_df <- xml_data %>%
xml_find_all("//book") %>%
map_df(~data.frame(
title = xml_text(xml_find_first(.x, "./title")),
author = xml_text(xml_find_first(.x, "./author")),
notes = xml_text(xml_find_first(.x, "./notes"))
))
print(xml_df)
## title author
## 1 Getting to Yes Roger Fisher, William L. Ury, Bruce Patton
## 2 Diary of a CEO Steven Bartlett
## 3 Psychology of Money Morgan Housel
## notes
## 1 Negotiation principles and strategies.
## 2 Insights into entrepreneurship and leadership.
## 3 Behavioral economics and personal finance.
Loading JSON
json_df <- fromJSON("books.json")
print(json_df)
## title author
## 1 Getting to Yes Roger Fisher, William L. Ury, Bruce Patton
## 2 Diary of a CEO Steven Bartlett
## 3 Psychology of Money Morgan Housel
## notes
## 1 Negotiation principles and strategies.
## 2 Insights into entrepreneurship and leadership.
## 3 Behavioral economics and personal finance.
Compare Data Frames
identical(html_df, xml_df) && identical(html_df, json_df)
## [1] FALSE
str(html_df)
## 'data.frame': 3 obs. of 3 variables:
## $ title : chr "Diary of a CEO" "Getting to Yes" "Psychology of Money"
## $ author: chr "Steven Bartlett" "Roger Fisher, William L. Ury, Bruce Patton" "Morgan Housel"
## $ notes : chr "Insights into entrepreneurship and leadership." "Negotiation principles and strategies." "Behavioral economics and personal finance."
str(xml_df)
## 'data.frame': 3 obs. of 3 variables:
## $ title : chr "Getting to Yes" "Diary of a CEO" "Psychology of Money"
## $ author: chr "Roger Fisher, William L. Ury, Bruce Patton" "Steven Bartlett" "Morgan Housel"
## $ notes : chr "Negotiation principles and strategies." "Insights into entrepreneurship and leadership." "Behavioral economics and personal finance."
str(json_df)
## 'data.frame': 3 obs. of 3 variables:
## $ title : chr "Getting to Yes" "Diary of a CEO" "Psychology of Money"
## $ author: chr "Roger Fisher, William L. Ury, Bruce Patton" "Steven Bartlett" "Morgan Housel"
## $ notes : chr "Negotiation principles and strategies." "Insights into entrepreneurship and leadership." "Behavioral economics and personal finance."
html_df <- data.frame(lapply(html_df, as.character), stringsAsFactors = FALSE)
xml_df <- data.frame(lapply(xml_df, as.character), stringsAsFactors = FALSE)
json_df <- data.frame(lapply(json_df, as.character), stringsAsFactors = FALSE)
html_df <- data.frame(lapply(html_df, trimws), stringsAsFactors = FALSE)
xml_df <- data.frame(lapply(xml_df, trimws), stringsAsFactors = FALSE)
json_df <- data.frame(lapply(json_df, trimws), stringsAsFactors = FALSE)
html_df <- html_df[order(html_df$title), ]
xml_df <- xml_df[order(xml_df$title), ]
json_df <- json_df[order(json_df$title), ]
html_df <- as.data.frame(html_df)
html_df <- html_df[order(html_df$title), ]
if (!is.data.frame(html_df)) {
print("html_df is not a data frame.")
} else {
print("html_df is a data frame.")
}
## [1] "html_df is a data frame."
if ("title" %in% colnames(html_df)) {
print("Column 'title' exists in html_df.")
} else {
print("Column 'title' does not exist in html_df.")
}
## [1] "Column 'title' exists in html_df."
str(html_df)
## 'data.frame': 3 obs. of 3 variables:
## $ title : chr "Diary of a CEO" "Getting to Yes" "Psychology of Money"
## $ author: chr "Steven Bartlett" "Roger Fisher, William L. Ury, Bruce Patton" "Morgan Housel"
## $ notes : chr "Insights into entrepreneurship and leadership." "Negotiation principles and strategies." "Behavioral economics and personal finance."
str(xml_df)
## 'data.frame': 3 obs. of 3 variables:
## $ title : chr "Diary of a CEO" "Getting to Yes" "Psychology of Money"
## $ author: chr "Steven Bartlett" "Roger Fisher, William L. Ury, Bruce Patton" "Morgan Housel"
## $ notes : chr "Insights into entrepreneurship and leadership." "Negotiation principles and strategies." "Behavioral economics and personal finance."
str(json_df)
## 'data.frame': 3 obs. of 3 variables:
## $ title : chr "Diary of a CEO" "Getting to Yes" "Psychology of Money"
## $ author: chr "Steven Bartlett" "Roger Fisher, William L. Ury, Bruce Patton" "Morgan Housel"
## $ notes : chr "Insights into entrepreneurship and leadership." "Negotiation principles and strategies." "Behavioral economics and personal finance."
html_df <- data.frame(lapply(html_df, as.character), stringsAsFactors = FALSE)
xml_df <- data.frame(lapply(xml_df, as.character), stringsAsFactors = FALSE)
json_df <- data.frame(lapply(json_df, as.character), stringsAsFactors = FALSE)
html_df <- data.frame(lapply(html_df, trimws), stringsAsFactors = FALSE)
xml_df <- data.frame(lapply(xml_df, trimws), stringsAsFactors = FALSE)
json_df <- data.frame(lapply(json_df, trimws), stringsAsFactors = FALSE)
html_df <- html_df[order(html_df$title), ]
xml_df <- xml_df[order(xml_df$title), ]
json_df <- json_df[order(json_df$title), ]
Confirmed?: Are all 3 Dataframes Identical?
identical(html_df, xml_df)
## [1] TRUE
identical(html_df, json_df)
## [1] TRUE
identical(xml_df, json_df)
## [1] TRUE
all.equal(html_df, xml_df)
## [1] TRUE
all.equal(html_df, json_df)
## [1] TRUE
all.equal(xml_df, json_df)
## [1] TRUE
The Result:
The fact that all three of my data frames are identical suggests
that the information extracted from the HTML, XML, and JSON sources for
all three books is consistent. This indicates that the parsing of data
was accurate and successful, resulting in reliable information across
all formats. It implies that the choice of format does not impact the
accuracy of the information provided. Moreover, it suggests that the
data can be used interchangeably, offering flexibility in how it can be
utilized and manipulated according to preference.