Introduction

For this assignment I have prepared three separate files in HTML, XML, and JSON formats, each containing the following information about my favorite books:

Title
Author
Published Year
Genre

Each of the files were loaded into Github and then into R.

Load necessary libraries:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(jsonlite)

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten

library(xml2)
library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

Load HTML file into a data frame

htmlurl <- "https://raw.githubusercontent.com/amedina613/Data607-Week-7-Assignment/main/books.html"
html_data <- read_html(htmlurl)

# The HTML file loaded as class "xml_document"
class(html_data)

## [1] "xml_document" "xml_node"

Extract HTML table and convert it to a data frame:

html_table <- html_data %>%
  html_table(fill = TRUE)
 
html_df <- as.data.frame(html_table)

Load XML file into a data frame

xmlurl <- ("https://raw.githubusercontent.com/amedina613/Data607-Week-7-Assignment/main/books.xml")

xml_data <- read_xml(xmlurl)
#The xml file is loaded as class "xml_document"
class(xml_data)

## [1] "xml_document" "xml_node"

Extract the information from the XML:

titles <- xml_text(xml_find_all(xml_data, ".//title"))
authors <- xml_text(xml_find_all(xml_data, ".//authors"))
published_years <- as.numeric(xml_text(xml_find_all(xml_data, ".//published_year")))
genres <- xml_text(xml_find_all(xml_data, ".//genre"))

Create a data frame for the information:

xml_df <- data.frame(
  title = titles,
  authors = authors,
  published_year = published_years,
  genre = genres
)

Load JSON file into a data frame

Read JSON data from URL:

json_url <- "https://raw.githubusercontent.com/amedina613/Data607-Week-7-Assignment/main/books.json"

json_data <- fromJSON(json_url)
#The JSON file is loaded as class data.frame
class(json_data)

## [1] "data.frame"

Convert to data frame:

json_df <- as.data.frame(json_data)

View the data frames:

print(html_df)

##                                Title         Authors Published.Year
## 1          The House of the Scorpion    Nancy Farmer           2002
## 2 The Inheritance of Orquidea Divina Zoraida Cordova           2021
## 3                          Moby Dick Herman Melville           1851
##               Genre
## 1   Science Fiction
## 2   Fantasy Fiction
## 3 Adventure Fiction

print(xml_df)

##                                title         authors published_year
## 1          The House of the Scorpion    Nancy Farmer           2002
## 2 The Inheritance of Orquidea Divina Zoraida Cordova           2021
## 3                          Moby Dick Herman Melville           1851
##               genre
## 1   Science Fiction
## 2   Fantasy Fiction
## 3 Adventure Fiction

print(json_df)

##                                title         authors published_year
## 1          The House of the Scorpion    Nancy Farmer           2002
## 2 The Inheritance of Orquidea Divina Zoraida Cordova           2021
## 3                          Moby Dick Herman Melville           1851
##               genre
## 1   Science Fiction
## 2   Fantasy Fiction
## 3 Adventure Fiction

Conclusion

There are a couple of differences in the naming of the columns. In the HTML data frame, the published year column is named “Published.Year” while in the other two they’re named “published_year.” The differences in column names could easily be fixed.

Standardize column names:

names(html_df) <- c("title", "authors", "published_year", "genre")
names(xml_df) <- c("title", "authors", "published_year", "genre")
names(json_df) <- c("title", "authors", "published_year", "genre")

Another difference I noticed was the class of the imported data from each file. The HTML File(parsed with rvest):

“xml_document”

The XML File(Parsed with xml2):

“xml_document”

The JSON File (Parsed with jsonlite):

“data.frame”

Data 607- Week 7 assignment

Adriana Medina

2024-03-10

Introduction

Load HTML file into a data frame

Load XML file into a data frame

Load JSON file into a data frame

Conclusion