Objective

The goal of this assignment was to work with different data formats: HTML, XML, and JSON in R. We selected data about three books and stored this data in the three different formats. Subsequently, we wrote R code to load this data into R and convert it into data frames.

Selected Data

We chose three books related to programming and software engineering as the subject of interest. The selected books along with their attributes are as follows:

  1. The Pragmatic Programmer
    • Authors: Andrew Hunt, David Thomas
    • Published Year: 1999
    • Genre: Software Engineering
  2. Clean Code: A Handbook of Agile Software Craftsmanship
    • Author: Robert C. Martin
    • Published Year: 2008
    • Genre: Software Engineering
  3. Code: The Hidden Language of Computer Hardware and Software
    • Author: Charles Petzold
    • Published Year: 1999
    • Genre: Computer Science

R Code

# Load necessary libraries
library(jsonlite)
library(xml2)
## Warning: package 'xml2' was built under R version 4.2.3
library(rvest)
## Warning: package 'rvest' was built under R version 4.2.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:jsonlite':
## 
##     flatten
# Load JSON data
json_data <- fromJSON("books.json")
df_json <- as.data.frame(json_data$books)

# Load XML data
xml_data <- read_xml("books.xml")
xml_df <- xml_data %>% 
  xml_find_all("//book") %>% 
  map_df(~{
    tibble(
      title = xml_text(xml_find_first(.x, "title")),
      authors = xml_text(xml_find_first(.x, "authors")),
      published_year = xml_text(xml_find_first(.x, "published_year")),
      genre = xml_text(xml_find_first(.x, "genre"))
    )
  })

# Load HTML data
html_data <- read_html("books.html")
df_html <- html_data %>% html_table(fill = TRUE) %>% .[[1]]

# Check if all data frames are identical
identical(df_json, xml_df) # Compare JSON and XML data frames
## [1] FALSE
identical(df_json, df_html) # Compare JSON and HTML data frames
## [1] FALSE