Working with XML and JSON in R

E-R Diagram

R Packages

library(tidyverse) #loading all library needed for this assignment
#library(openintro)
#head(fastfood)
library(readxl) 
library(data.table)
library(readr)
library(plyr)
library(dplyr)
library(stringr)
library(XML)
library(RCurl)
library(jsonlite)
#library(maps)
#library(dice)
# #library(VennDiagram)
# #library(help = "dice")
#ibrary(DBI)
#library(dbplyr)

# library(rstudioapi)
# library(RJDBC)
# library(odbc)
# library(RSQLite)
# #library(rvest)

#library(readtext)
#library(ggpubr)
#library(fitdistrplus)
#library(ggplot2)
#library(moments)
#library(qualityTools)
#library(normalp)
#library(utils)
#library(MASS)
#library(qqplotr)
#library(DATA606)

Github Link:

Description:

This assginment is about learning how to pull less-structured data from the web. To get started, I will get familiar with R and HTML, XML and JSON formats: For Rmarkdown,, we will create R-srcipt to generate file in HTML, XML and JSON formats. We will use 03 books of a choice and use the R-script to access information about this book and display them. These information include the title, authors, and 2-3 other attributes of three books, at least one having more than one author. This assignment will also give more string manipulation functions and regular expressions.We should have 03 distincts files of the sames books in each of these 03 formats.

Approach

My approach is to read: Automated Data Collection with R, chapters 2 through 6., Data Science for Business, Chapter 7.Handling and Processing Strings in R, Gaston Sanchez, Includes link to freely downloadable (113 pp.!) eBook with good material on base R and stringr package string manipulation and grep. If you are new to HTML, XML and JSON, and HTTP technologies, you may want to go through one of the many good on-line tutorials. For example, here are w3schools.com’s introduction to JSON, and to XPath. https://www.w3schools.com/js/js_json_intro.asp https://www.w3schools.com/xml/xpath_intro.asp

load required library or Rpackages, download book to github
library(RCurl) library(XML) library(stringr) library(plyr)

HTML Fornat

Hypertext Markup Language or HTML is a language for presenting content on the Web that was first proposed by Tim Berners-Lee (1989). Although,HTML is not a dedicated data storage format, it frequently contains the information that we are interested in. We find data in texts, tables, lists, links, or other structures. Unfortunately, there is a difference between the way data are presented in a browser on the one side and how they are stored within the HTML code on the other.

Very long time since I did some Html coding.

#from the book
# heritage_parsed <- htmlParse("http://en.wikipedia.org/wiki/
# List_of_World_Heritage_in_Danger",
# encoding = "UTF-8")
# R> tables <- readHTMLTable(heritage_parsed, stringsAsFactors = FALSE)
# readLines()
# 
# url <- "http://wordnetweb.princeton.edu/perl/webwn"
# R> forms <- getHTMLFormDescription(url)

# link to access data I needed
books.html <- getURL("https://raw.githubusercontent.com/asmozo24/DATA607_Assignment7/main/My_03Books.html")
# reading the table
books.html <- readHTMLTable(books.html, header = TRUE)

books.html

## $`NULL`
##                         Book_Title
## 1        Data Science for Business
## 2 Automated Data Collection with R
## 3       Learning statistics with R
##                                                        Book_Authors
## 1                                    Foster Provost and Tom Fawcett
## 2 Simon Munzert, Christian\nRubba, Peter MeiÃƒÂŸner, Dominic Nyhuis
## 3                                           Danielle Joseph Navarro
##   Book_Published_year  Book_version Book_Volume
## 1                2013 First Edition   384 pages
## 2                2015 First Edition   452 pages
## 3                2014 Fouth Edition   599 pages

view(books.html)

XML Format

The ExtensibleMarkup Language orXML is one of themost popular formats for exchanging data over the Web. It is related to HTML in that both are markup languages. However, while HTML is used to shape the display of information, the main purpose of XML is to store data.

# link to github to access data needed
books.xml<- "https://raw.githubusercontent.com/asmozo24/DATA607_Assignment7/main/My_03Books.XML"

#Get the content of books.xml and assigns it to book.xml 
books.xml<- getURLContent(books.xml) 
# convert using function xmlToDataFrame(x) 
books.xml<- xmlToDataFrame(books.xml) 
view(books.xml)

Books Json

JSON. The acronym (pronounced “Jason”) stands for JavaScript Object Notation. JSON was designed for the same tasks that XML is often used for—the storage and exchange of humanreadable data. Many APIs by popular web applications provide data in the JSON format. JSON is a data format that has its origins in the JavaScript programming language.JSON allows a set of different data types for the value part of key/value pairs. From the perspective of an XML user, note what is not possible in JSON: We cannot add comments, we do not distinguish between missing values and null values, there are no namespaces and no internal validation syntax like XML’s DTD. But this does not make JSON inferior to XML in absolute terms. They are rather based on different concepts. JSON is not a markup language and not even a document format.

#Get the content of books.xml and assigns it to book.xml 
Books.Json <- getURL("https://raw.githubusercontent.com/asmozo24/DATA607_Assignment7/main/My_03Books.JSON")
# call function fromJSON()
Books.Json <- fromJSON(Books.Json)
Books.Json

## $My_03Books.JSON
##                         Book_Title
## 1        Data Science for Business
## 2 Automated Data Collection with R
## 3       Learning statistics with R
##                                                              Book_Authors
## 1                                          Foster Provost and Tom Fawcett
## 2 Simon Munzert and Christian Rubba and Peter Meissner and Dominic Nyhuis
## 3                                                 Danielle Joseph Navarro
##   Book_Published_year  Book_version Book_Volume
## 1                2013 First Edition   384 pages
## 2                2015 First Edition   452 pages
## 3                2014 Fouth Edition   599 pages

view(Books.Json)

Conclusion:

The data is the same data. However, When I call view() on the 03 files I see a problem on the header with html and json. also, html and json parse the Peter Meißner, Peter MeiÃƒÂŸner, which mean the special character is not accpeted or the equivalent is wrong.