For this assignment I have prepared three separate files in HTML, XML, and JSON formats, each containing the following information about my favorite books:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
library(xml2)
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
htmlurl <- "https://raw.githubusercontent.com/amedina613/Data607-Week-7-Assignment/main/books.html"
html_data <- read_html(htmlurl)
# The HTML file loaded as class "xml_document"
class(html_data)
## [1] "xml_document" "xml_node"
html_table <- html_data %>%
html_table(fill = TRUE)
html_df <- as.data.frame(html_table)
xmlurl <- ("https://raw.githubusercontent.com/amedina613/Data607-Week-7-Assignment/main/books.xml")
xml_data <- read_xml(xmlurl)
#The xml file is loaded as class "xml_document"
class(xml_data)
## [1] "xml_document" "xml_node"
titles <- xml_text(xml_find_all(xml_data, ".//title"))
authors <- xml_text(xml_find_all(xml_data, ".//authors"))
published_years <- as.numeric(xml_text(xml_find_all(xml_data, ".//published_year")))
genres <- xml_text(xml_find_all(xml_data, ".//genre"))
xml_df <- data.frame(
title = titles,
authors = authors,
published_year = published_years,
genre = genres
)
json_url <- "https://raw.githubusercontent.com/amedina613/Data607-Week-7-Assignment/main/books.json"
json_data <- fromJSON(json_url)
#The JSON file is loaded as class data.frame
class(json_data)
## [1] "data.frame"
json_df <- as.data.frame(json_data)
print(html_df)
## Title Authors Published.Year
## 1 The House of the Scorpion Nancy Farmer 2002
## 2 The Inheritance of Orquidea Divina Zoraida Cordova 2021
## 3 Moby Dick Herman Melville 1851
## Genre
## 1 Science Fiction
## 2 Fantasy Fiction
## 3 Adventure Fiction
print(xml_df)
## title authors published_year
## 1 The House of the Scorpion Nancy Farmer 2002
## 2 The Inheritance of Orquidea Divina Zoraida Cordova 2021
## 3 Moby Dick Herman Melville 1851
## genre
## 1 Science Fiction
## 2 Fantasy Fiction
## 3 Adventure Fiction
print(json_df)
## title authors published_year
## 1 The House of the Scorpion Nancy Farmer 2002
## 2 The Inheritance of Orquidea Divina Zoraida Cordova 2021
## 3 Moby Dick Herman Melville 1851
## genre
## 1 Science Fiction
## 2 Fantasy Fiction
## 3 Adventure Fiction
There are a couple of differences in the naming of the columns. In the HTML data frame, the published year column is named “Published.Year” while in the other two they’re named “published_year.” The differences in column names could easily be fixed.
names(html_df) <- c("title", "authors", "published_year", "genre")
names(xml_df) <- c("title", "authors", "published_year", "genre")
names(json_df) <- c("title", "authors", "published_year", "genre")
Another difference I noticed was the class of the imported data from each file. The HTML File(parsed with rvest):
The XML File(Parsed with xml2):
The JSON File (Parsed with jsonlite):