data607_assignment5

In this assignment, we have to read the tables from three different files and they are XML, HTML, and JSON. The files are uploaded in the github and the url is used to fetch the data.

library(rvest)

from_html <- read_html("https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/Github_data607/assignment_5/Fav_books_assignment5.html")

# bookTable df is created and all the data are stored here from_html. fill attributes has been used here to automatically fill rows with fewer than the maximum number of columns with `NA`s.
bookTable <- from_html |> 
  html_table(fill = TRUE)

bookTable

## [[1]]
## # A tibble: 3 × 4
##   Title                               Author                     Year …¹ Publi…²
##   <chr>                               <chr>                        <int> <chr>  
## 1 Providence                          Max Berry                     2020 G. P. …
## 2 The Art of Star Wars: Galaxy’s Edge Amy Ratcliffe                 2021 Orbit  
## 3 Data Science for Business           Foster Provost, Tom Fawce…    2013 O'Reil…
## # … with abbreviated variable names ¹`Year of Publish`, ²Publisher

library(XML)
library(xml2)
library(plyr)

from_xml <- read_xml("https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/Github_data607/assignment_5/Fav_books_assignment5.xml")

# parsing the data of the raw file received from the URL
xml_format <- xmlParse(file = from_xml)

#converting the parsed file to data.frame using ldply funtion and attributes xmlToList is used to convert. 
xml_df <- ldply(xmlToList(xml_format), data.frame)
xml_df

##     .id                               title         author Year_of_Publish
## 1 books                          Providence      Max Berry            2020
## 2 books The Art of Star Wars: Galaxy’s Edge  Amy Ratcliffe            2021
## 3 books           Data Science for Business Foster Provost            2013
##             publisher    author.1
## 1 G. P. Putnam’s Sons        <NA>
## 2               Orbit        <NA>
## 3   O'Reily Media Inc Tom Fawcett

library(jsonlite)

#fromJson is a library from jsonlite that converts the JSON objects into R objects. 
from_json <- fromJSON("https://raw.githubusercontent.com/karmaggyatso/CUNY_SPS/main/Github_data607/assignment_5/Fav_books_assignment5.json")

from_json

## $books
##                                 title                      author
## 1                          Providence                   Max Berry
## 2 The Art of Star Wars: Galaxy’s Edge               Amy Ratcliffe
## 3           Data Science for Business Foster Provost, Tom Fawcett
##   Year of Publish           publisher
## 1            2020 G. P. Putnam’s Sons
## 2            2021               Orbit
## 3            2013   O'Reily Media Inc

Conclusion:

Every file was imported differently. The datasets are not identical as the structer of the file extension are different as well. In the XML, I used two author tags to specify that the book contains two authors and in the R, there is an additional column called author.1. Similarly, in the JSON file the authors were in a list and in R it is catagorised in list as well without displaying the result.

data607_assignment5

karmaGyatso

2022-10-16

Conclusion: