Assignment – Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Load Packages

library(XML)
library(RCurl)
library(rvest)
## Warning: package 'rvest' was built under R version 4.1.3
library(xml2)
## Warning: package 'xml2' was built under R version 4.1.3
library(arsenal)
## Warning: package 'arsenal' was built under R version 4.1.3
library(jsonlite)
library(DT)
## Warning: package 'DT' was built under R version 4.1.3

Reading data from the HTML file

Link <- "https://raw.githubusercontent.com/Eperez54/Dat-607/main/Week%207/Books.html"

htmlContent <- read_html(Link)

booksHTML <- htmlContent %>%
    html_nodes("table") %>%
   html_table(fill = TRUE)


booksHTML <- booksHTML[[1]]

datatable(booksHTML)
head(booksHTML)
## # A tibble: 4 x 6
##   `Book Title`                      Author `ISBN-13` Pages Binding `Amazon Link`
##   <chr>                             <chr>  <chr>     <int> <chr>   <chr>        
## 1 C# 10 and .NET 6 – Modern Cross-~ Mark ~ 978-1801~   824 Ebook   Amazon Link  
## 2 Head First C#: A Learner's Guide~ Jenni~ 978-1491~   800 Paperb~ Amazon Link  
## 3 R for Data Science: Import, Tidy~ Hadle~ 978-1491~   520 Paperb~ Amazon Link  
## 4 Intro to Python for Computer Sci~ Paul ~ 978-0135~   880 Paperb~ Amazon Link

Reading data from the JSON file

json_URL <- "https://raw.githubusercontent.com/Eperez54/Dat-607/main/Week%207/Books.json"
booksJSON <- fromJSON(json_URL)
booksJSON <- booksJSON[[1]]
datatable(booksJSON)
head(booksJSON)
##                                                                                                                                                                                                                          title
## 1 C# 10 and .NET 6 – Modern Cross-Platform Development: Build apps, websites, and services with ASP.NET Core 6, Blazor, and EF Core 6 using Visual Studio 2022 and Visual Studio Code, 6th Edition 6th Edition, Kindle Edition
## 2                                                                                                                                 Head First C#: A Learner's Guide to Real-World Programming with C# and .NET Core 4th Edition
## 3                                                                                                                                           R for Data Science: Import, Tidy, Transform, Visualize, and Model Data 1st Edition
## 4                                                                                                       Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud 1st Edition
##                                 author        ISBN-13 pages   binding
## 1                        Mark J. Price 978-1801077361   824     Ebook
## 2                      Jennifer Greene 978-1491976708   800 Paperback
## 3    Hadley Wickham, Garrett Grolemund 978-1491910399   520 Paperback
## 4 Paul J. Deitel, Dr. Harvey M. Deitel 978-0135404676   880 Paperback
##                                                                                                                                                                                                                amazonURL
## 1                                                                     https://www.amazon.com/10-NET-Cross-Platform-Development-websites-ebook-dp-B09JV37DM6/dp/B09JV37DM6/ref=mt_other?_encoding=UTF8&me=&qid=1647752051
## 2                                                                                     https://www.amazon.com/Head-First-Learners-Real-World-Programming-dp-1491976705/dp/1491976705/ref=mt_other?_encoding=UTF8&me=&qid=
## 3                       https://www.amazon.com/Data-Science-Transform-Visualize-Model/dp/1491910399/ref=sr_1_1?crid=1ZKER249YHIJX&keywords=r+for+data+science&qid=1647752610&s=books&sprefix=R+%2Cstripbooks%2C45&sr=1-1
## 4 https://www.amazon.com/Intro-Python-Computer-Science-Data/dp/0135404673/ref=sr_1_9?crid=38FCK6SG4KJCM&keywords=python+for+data+science&qid=1647752945&s=books&sprefix=python+for+data+science%2Cstripbooks%2C42&sr=1-9

Reading data from XML file

xml_URL <- "https://raw.githubusercontent.com/Eperez54/Dat-607/main/Week%207/Books.xml"

xmlContent <- read_xml(xml_URL)

booksXML <- xmlParse(xmlContent)
booksXML <- xmlToDataFrame(booksXML)
datatable(booksXML)
str(booksXML)
## 'data.frame':    4 obs. of  5 variables:
##  $ BookTitle: chr  "C# 10 and .NET 6 – Modern Cross-Platform Development: Build apps, websites, and services with ASP.NET Core 6, B"| __truncated__ "Head First C#: A Learner's Guide to Real-World Programming with C# and .NET Core 4th Edition" "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data 1st Edition" "Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud 1st Edition"
##  $ author   : chr  "Mark J. Price" "Jennifer Greene" "Hadley Wickham, Garrett Grolemund" "Paul J. Deitel, Dr. Harvey M. Deitel"
##  $ ISBN-13  : chr  "978-1801077361" "978-1491976708" "978-1491910399" "978-0135404676"
##  $ pages    : chr  "824" "800" "520" "880"
##  $ binding  : chr  "Ebook" "Paperback" "Paperback" "Paperback"

Compare dataframes using comparedf from arsenal package

comparedf(booksXML, booksHTML)
## Compare Object
## 
## Function Call: 
## comparedf(x = booksXML, y = booksHTML)
## 
## Shared: 1 non-by variables and 4 observations.
## Not shared: 9 variables and 0 observations.
## 
## Differences found in 0/1 variables compared.
## 0 variables compared have non-identical attributes.
comparedf(booksXML, booksJSON)
## Compare Object
## 
## Function Call: 
## comparedf(x = booksXML, y = booksJSON)
## 
## Shared: 4 non-by variables and 4 observations.
## Not shared: 3 variables and 0 observations.
## 
## Differences found in 0/2 variables compared.
## 0 variables compared have non-identical attributes.
comparedf(booksHTML, booksJSON)
## Compare Object
## 
## Function Call: 
## comparedf(x = booksHTML, y = booksJSON)
## 
## Shared: 1 non-by variables and 4 observations.
## Not shared: 10 variables and 0 observations.
## 
## Differences found in 0/1 variables compared.
## 0 variables compared have non-identical attributes.

Conclusion

The files are not identical. The files might look similar but upon comparing them we can see how they are different