Objective of the Assignment:
The goal of this assignment is to manipulate Less-structured data from the web. We’ll focus on the standard formats for web data: HTML, XML, and JSON. Three books were selected and information were stored in three files HTML, XML and JASON, from where data will be loaded.
Load the required Libraries
library(XML) #for xml processing
library(rvest) # to scrape (or harvest) data from html web pages
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
##
## xml
library(jsonlite) # for json files processing
library("kableExtra") # Construct Complex Table with 'kable' and Pipe Syntax
HTML file Manupulation
htmlfile <- "https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week7/book.html"
htmlTable <- read_html(htmlfile)
htmlBooks <- htmlTable %>%
html_nodes("table") %>% # Select nodes from THE HTML document
.[[1]] %>%
html_table(fill = NA) # converts data to an R data frame automatically
colnames(htmlBooks) <- c("Category","Title", "Authros", "Publisher","Published Date", "ISBN")
htmlBooks %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "200px")
|
Category
|
Title
|
Authros
|
Publisher
|
Published Date
|
ISBN
|
|
Nutrition
|
The China Study
|
Thomas Campbell; T. Colin Campbell
|
BenBella Books
|
May 2006
|
1932100660
|
|
Motivational
|
The 7 Habits of Highly Effective People
|
Stephen R. Covey
|
Free Press
|
November 2004
|
743269519
|
|
Computers
|
The Linux Programming Interface
|
Michael Kerrisk
|
No Starch Press
|
October 2019
|
1593272200
|
XML file Manupulation
xmlfile <- "https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week7/book.xml"
xmlData <- read_xml(xmlfile)
Books <- xml_children(xmlData)
xmlBooks <- c()
for (i in 1:length(Books)){
xmlBooks <- rbind(xmlBooks,xml_text(xml_children(Books[i])))
}
xmlBooks <- data.frame(xmlBooks)
colnames(xmlBooks) <- xml_name(xml_children(Books[1]))
xmlBooks %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "200px")
|
title
|
category
|
authors
|
publisher
|
published_date
|
isbn-10
|
|
The China Study
|
Nutrition
|
Thomas Campbell; T. Colin Campbell
|
BenBella Books
|
May 2006
|
1932100660
|
|
The 7 Habits of Highly Effective People
|
Motivational
|
Stephen R. Covey
|
Free Press
|
November 2004
|
0743269519
|
|
The Linux Programming Interface
|
Computers
|
Michael Kerrisk
|
Apress
|
October 2019
|
1593272200
|
JSON Manupulation
jsonfile <- "https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week7/book.json"
jsonBooks <- fromJSON(jsonfile) #converts any json object to an R data frame.
jsonBooks %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "100%", height = "200px")
|
title
|
category
|
authors
|
publisher
|
published date
|
isbn-10
|
|
The China Study
|
Nutrition
|
Thomas Campbell; T. Colin Campbell
|
BenBella Books
|
May 2006
|
1932100660
|
|
The 7 Habits of Highly Effective People
|
Motivational
|
Stephen R. Covey
|
Free Press
|
November 2004
|
0743269519
|
|
The Linux Programming Interface
|
Computers
|
Michael Kerrisk
|
No Starch Press
|
October 2019
|
1593272200
|
Conclusion
While all 3 files html, xml and json have different structures , We were able to get the same resulting dataframes As we can see.