Objective of the Assignment:

The goal of this assignment is to manipulate Less-structured data from the web. We’ll focus on the standard formats for web data: HTML, XML, and JSON. Three books were selected and information were stored in three files HTML, XML and JASON, from where data will be loaded.

Load the required Libraries

library(XML) #for xml processing
library(rvest)  # to scrape (or harvest) data from html web pages
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
## 
##     xml
library(jsonlite) # for json files processing
library("kableExtra") # Construct Complex Table with 'kable' and Pipe Syntax

HTML file Manupulation

htmlfile <- "https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week7/book.html"

htmlTable <- read_html(htmlfile)
htmlBooks <- htmlTable %>%
  html_nodes("table") %>%  # Select nodes from THE HTML document
  .[[1]] %>%
  html_table(fill = NA) # converts data to an R data frame automatically

colnames(htmlBooks) <- c("Category","Title", "Authros", "Publisher","Published Date", "ISBN")

htmlBooks  %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "200px")
Category Title Authros Publisher Published Date ISBN
Nutrition The China Study Thomas Campbell; T. Colin Campbell BenBella Books May 2006 1932100660
Motivational The 7 Habits of Highly Effective People Stephen R. Covey Free Press November 2004 743269519
Computers The Linux Programming Interface Michael Kerrisk No Starch Press October 2019 1593272200

XML file Manupulation

xmlfile <- "https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week7/book.xml"
 
xmlData <- read_xml(xmlfile)
Books <- xml_children(xmlData)

xmlBooks <- c()
for (i in 1:length(Books)){
  xmlBooks <- rbind(xmlBooks,xml_text(xml_children(Books[i])))
}

xmlBooks <- data.frame(xmlBooks)

colnames(xmlBooks) <- xml_name(xml_children(Books[1]))

xmlBooks  %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "200px")
title category authors publisher published_date isbn-10
The China Study Nutrition Thomas Campbell; T. Colin Campbell BenBella Books May 2006 1932100660
The 7 Habits of Highly Effective People Motivational Stephen R. Covey Free Press November 2004 0743269519
The Linux Programming Interface Computers Michael Kerrisk Apress October 2019 1593272200

JSON Manupulation

jsonfile <- "https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week7/book.json"

jsonBooks <- fromJSON(jsonfile) #converts any json object to an R data frame.

jsonBooks  %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "100%", height = "200px")
title category authors publisher published date isbn-10
The China Study Nutrition Thomas Campbell; T. Colin Campbell BenBella Books May 2006 1932100660
The 7 Habits of Highly Effective People Motivational Stephen R. Covey Free Press November 2004 0743269519
The Linux Programming Interface Computers Michael Kerrisk No Starch Press October 2019 1593272200

Conclusion

While all 3 files html, xml and json have different structures , We were able to get the same resulting dataframes As we can see.