In this assignment, we have to read the tables from three different files and they are XML, HTML, and JSON. The files are uploaded in the github and the url is used to fetch the data.
library(rvest)
from_html <- read_html("https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/Github_data607/assignment_5/Fav_books_assignment5.html")
# bookTable df is created and all the data are stored here from_html. fill attributes has been used here to automatically fill rows with fewer than the maximum number of columns with `NA`s.
bookTable <- from_html |>
html_table(fill = TRUE)
bookTable
## [[1]]
## # A tibble: 3 × 4
## Title Author Year …¹ Publi…²
## <chr> <chr> <int> <chr>
## 1 Providence Max Berry 2020 G. P. …
## 2 The Art of Star Wars: Galaxy’s Edge Amy Ratcliffe 2021 Orbit
## 3 Data Science for Business Foster Provost, Tom Fawce… 2013 O'Reil…
## # … with abbreviated variable names ¹`Year of Publish`, ²Publisher
library(XML)
library(xml2)
library(plyr)
from_xml <- read_xml("https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/Github_data607/assignment_5/Fav_books_assignment5.xml")
# parsing the data of the raw file received from the URL
xml_format <- xmlParse(file = from_xml)
#converting the parsed file to data.frame using ldply funtion and attributes xmlToList is used to convert.
xml_df <- ldply(xmlToList(xml_format), data.frame)
xml_df
## .id title author Year_of_Publish
## 1 books Providence Max Berry 2020
## 2 books The Art of Star Wars: Galaxy’s Edge Amy Ratcliffe 2021
## 3 books Data Science for Business Foster Provost 2013
## publisher author.1
## 1 G. P. Putnam’s Sons <NA>
## 2 Orbit <NA>
## 3 O'Reily Media Inc Tom Fawcett
library(jsonlite)
#fromJson is a library from jsonlite that converts the JSON objects into R objects.
from_json <- fromJSON("https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/Github_data607/assignment_5/Fav_books_assignment5.json")
from_json
## $books
## title author
## 1 Providence Max Berry
## 2 The Art of Star Wars: Galaxy’s Edge Amy Ratcliffe
## 3 Data Science for Business Foster Provost, Tom Fawcett
## Year of Publish publisher
## 1 2020 G. P. Putnam’s Sons
## 2 2021 Orbit
## 3 2013 O'Reily Media Inc
Every file was imported differently. The datasets are not identical as the structer of the file extension are different as well. In the XML, I used two author tags to specify that the book contains two authors and in the R, there is an additional column called author.1. Similarly, in the JSON file the authors were in a list and in R it is catagorised in list as well without displaying the result.