Assignment 7

Daniel DeBonis

We will need programs from a variety of packages to import and convert these tables, so importing the relevant packages is essential.

library(rvest)
library(rjson)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(xml2)
## Warning: package 'xml2' was built under R version 4.4.3

Importing from html

After writing the table in Notepad, I can upload it to Github so it can be accessed anywhere.

html_df <- html_table(read_html("https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/books.html"))
print(html_df)
## [[1]]
## # A tibble: 3 × 6
##   Title  `Author 1` `Author 2` `Year of Publication` Publisher `Number of Pages`
##   <chr>  <chr>      <chr>                      <int> <chr>                 <int>
## 1 Colle… Jorge Lui… "Andrew H…                  1998 Penguin                 575
## 2 How t… Dennard D… ""                          2025 Henry Ho…               334
## 3 Appli… Max Kuhn   "Kjell Jo…                  2016 Springer                618

Importing from XML

When importing an XML file, there are several steps necessary to convert the file to a data frame, even though I also started by writing the table in XML code in Notepad. First the records need to be extracted then reassigned to a column.

xml_up <- read_xml("https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/books.xml")
records <- xml_find_all(xml_up, "//book")
titles <- xml_text(xml_find_all(records, "title"))
authors_1 <- xml_text(xml_find_all(records, "author_1"))
authors_2 <- xml_text(xml_find_all(records, "author_2"))
years <- xml_text(xml_find_all(records, "year"))
publishers <- xml_text(xml_find_all(records, "publisher"))
pages <- xml_text(xml_find_all(records, "pages"))

Once each column is extracted, they can be placed into a dataframe

xml_df <- data.frame(
  Title = titles,
  Author1 = authors_1,
  Author2 = authors_2,
  Year = years,
  Publisher = publishers,
  Pages = pages,
  stringsAsFactors = FALSE
)
print(xml_df)
##                         Title           Author1                    Author2 Year
## 1          Collected Fictions Jorge Luis Borges Andrew Hurley (translator) 1998
## 2   How to Dodge a Cannonball     Dennard Dayle                            2025
## 3 Applied Predictive Modeling          Max Kuhn              Kjell Johnson 2016
##                Publisher Pages
## 1                Penguin   575
## 2 Henry Holt and Company   334
## 3               Springer   618

Importing from JSON

I also created this table in Notepad following the format from other JSON tables. This type also requires a conversion to make the table a dataframe, but it is a much simpler process.

json_u <- fromJSON(file = "https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/books.json")
json_df <- as.data.frame(json_u)
print(json_df)
##                         Title           Author1                    Author2 Year
## 1          Collected Fictions Jorge Luis Borges Andrew Hurley (translator) 1998
## 2   How to Dodge a Cannonball     Dennard Dayle                            2025
## 3 Applied Predictive Modeling          Max Kuhn              Kjell Johnson 2016
##              Publisher Pages
## 1              Penguin   576
## 2 Henry Holt & Company   334
## 3             Springer   618

Now that the table has been imported in across the three formats, it is clear that for the most part the tables are identical. However, there are some subtle differences between them. One of the most critical differences is how the numerical data was categorized differently in the different conversions. From html, the year and pages columns were classified as integer characters. From xml, the same columns were classified automatically as character strings like every other column in the data frame. Finally from JSON, these columns are classified as doubles. There are some differences in the labels of some columns based on different labels I gave them when making the table. Another important distinction is in how the missing value is treated. One book listed only has one author. The uploads from XML and JSON correctly account for this and have an empty cell, but the HTML conversion has two quotations marks in the cell. In fact, every value in that column has quotations marks around them. No quotation marks were used in creating the table, as can be verified from the copy of the html file on Github, yet they appeared for every value in that column.