Working with XML and JSON in R

Objective

The goal of this assignment was to work with different data formats: HTML, XML, and JSON in R. We selected data about three books and stored this data in the three different formats. Subsequently, we wrote R code to load this data into R and convert it into data frames.

Selected Data

We chose three books related to programming and software engineering as the subject of interest. The selected books along with their attributes are as follows:

The Pragmatic Programmer
- Authors: Andrew Hunt, David Thomas
- Published Year: 1999
- Genre: Software Engineering
Clean Code: A Handbook of Agile Software Craftsmanship
- Author: Robert C. Martin
- Published Year: 2008
- Genre: Software Engineering
Code: The Hidden Language of Computer Hardware and Software
- Author: Charles Petzold
- Published Year: 1999
- Genre: Computer Science

R Code

# Load necessary libraries
library(jsonlite)
library(xml2)

## Warning: package 'xml2' was built under R version 4.2.3

library(rvest)

## Warning: package 'rvest' was built under R version 4.2.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(purrr)

## 
## Attaching package: 'purrr'

## The following object is masked from 'package:jsonlite':
## 
##     flatten

# Load JSON data
json_data <- fromJSON("books.json")
df_json <- as.data.frame(json_data$books)

# Load XML data
xml_data <- read_xml("books.xml")
xml_df <- xml_data %>% 
  xml_find_all("//book") %>% 
  map_df(~{
    tibble(
      title = xml_text(xml_find_first(.x, "title")),
      authors = xml_text(xml_find_first(.x, "authors")),
      published_year = xml_text(xml_find_first(.x, "published_year")),
      genre = xml_text(xml_find_first(.x, "genre"))
    )
  })

# Load HTML data
html_data <- read_html("books.html")
df_html <- html_data %>% html_table(fill = TRUE) %>% .[[1]]

# Check if all data frames are identical
identical(df_json, xml_df) # Compare JSON and XML data frames

## [1] FALSE

identical(df_json, df_html) # Compare JSON and HTML data frames

## [1] FALSE

Working with XML and JSON in R

Frederick Jones

2023-10-15

Objective

Selected Data

R Code