Introduction.

Week seven assignment is about preparing three different files with same data in it, to practice loading diferent formats into R Studio (HTML, JSON and XML). After convert the files to data frames, we have to compare to see if there is any differences, in order to understand better the file structures, the three files were manually written and then loaded into Github for reference and reproducibility. The files must contain the following information:Title, Authors, and two or three other attributes such as Genre, rating, and publisher.

Load the required libraries for the project.

library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RCurl)
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(rjson)
library(XML)
library(methods)
library(compare)
## 
## Attaching package: 'compare'
## 
## The following object is masked from 'package:base':
## 
##     isTRUE

Load HTML file and display data in console.

html_file = getURL("https://raw.githubusercontent.com/vitugo23/DATA607/main/Assignments/W7_Assignment/Book_table.html")
books.html <- readHTMLTable(html_file)
books.html <-books.html[[1]]
books.html
##                    Book Name                         Author           Genre
## 1                 "The Hunt"                  Andrew Fukuda  Horror-Fiction
## 2 "And Then There Were None"                Agatha Christie   Mistery-Drama
## 3      "Beautiful Creatures" Kami Garcia and Margaret Stohl Fantasy-Fiction
##         Rating                     Publisher
## 1 3.7 out of 5 St. Martin's Publishing Group
## 2 4.4 out of 5                Harper Collins
## 3 3.8 out of 5    Little, Brown, and Company

Load XML files and display data in console.

xml_file = getURL("https://raw.githubusercontent.com/vitugo23/DATA607/main/Assignments/W7_Assignment/Book_table.xml")
 books.xml <- xmlToDataFrame(xml_file)
 books.xml
##                       name                         author           genre
## 1                 The Hunt                  Andrew Fukuda  Horror-Fiction
## 2 And Then There Were None                Agatha Christie   Mistery-Drama
## 3      Beautiful Creatures Kami Garcia and Margaret Stohl Fantasy-Fiction
##         rating                     publisher
## 1 3.7 out of 5 St. Martin's Publishing Group
## 2 4.4 out of 5                Harper Collins
## 3 3.8 out of 5    Little, Brown, and Company

Load JSON file and display data in console.

json_file = getURL("https://raw.githubusercontent.com/vitugo23/DATA607/main/Assignments/W7_Assignment/Book_table.json")
books.json <- fromJSON(json_file)
books.json <- data.table::rbindlist(books.json)
## Column 2 ['Author'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
books.json
##                    bookName                         author           genre
##                      <char>                         <char>          <char>
## 1:                 The Hunt                  Andrew Fukuda  Horror-Fiction
## 2: And Then There Were None                Agatha Christie   Mistery-Drama
## 3:      Beautiful Creatures Kami Garcia and Margaret Stohl Fantasy-Fiction
##          rating                     publisher
##          <char>                        <char>
## 1: 3.7 out of 5 St. Martin's Publishing Group
## 2: 4.4 out of 5                Harper Collins
## 3: 3.8 out of 5    Little, Brown, and Company

Make a quick comparison between the three data frames, using .STR and compare function.

str(books.html)
## 'data.frame':    3 obs. of  5 variables:
##  $ Book Name: chr  "\"The Hunt\"" "\"And Then There Were None\"" "\"Beautiful Creatures\""
##  $ Author   : chr  "Andrew Fukuda" "Agatha Christie" "Kami Garcia and Margaret Stohl"
##  $ Genre    : chr  "Horror-Fiction" "Mistery-Drama" "Fantasy-Fiction"
##  $ Rating   : chr  "3.7 out of 5" "4.4 out of 5" "3.8 out of 5"
##  $ Publisher: chr  "St. Martin's Publishing Group" "Harper Collins" "Little, Brown, and Company"
str(books.xml)
## 'data.frame':    3 obs. of  5 variables:
##  $ name     : chr  "The Hunt" "And Then There Were None" "Beautiful Creatures"
##  $ author   : chr  "Andrew Fukuda" "Agatha Christie" "Kami Garcia and Margaret Stohl"
##  $ genre    : chr  "Horror-Fiction" "Mistery-Drama" "Fantasy-Fiction"
##  $ rating   : chr  "3.7 out of 5" "4.4 out of 5" "3.8 out of 5"
##  $ publisher: chr  "St. Martin's Publishing Group" "Harper Collins" "Little, Brown, and Company"
str(books.json)
## Classes 'data.table' and 'data.frame':   3 obs. of  5 variables:
##  $ bookName : chr  "The Hunt" "And Then There Were None" "Beautiful Creatures"
##  $ author   : chr  "Andrew Fukuda" "Agatha Christie" "Kami Garcia and Margaret Stohl"
##  $ genre    : chr  "Horror-Fiction" "Mistery-Drama" "Fantasy-Fiction"
##  $ rating   : chr  "3.7 out of 5" "4.4 out of 5" "3.8 out of 5"
##  $ publisher: chr  "St. Martin's Publishing Group" "Harper Collins" "Little, Brown, and Company"
##  - attr(*, ".internal.selfref")=<externalptr>
compare(books.html,books.xml,equal=TRUE)
## FALSE [FALSE, TRUE, TRUE, TRUE, TRUE]
compare(books.html,books.json,equal=TRUE)
## FALSE [FALSE, TRUE, TRUE, TRUE, TRUE]
compare(books.json,books.xml,equal = TRUE)
## FALSE [TRUE, TRUE, TRUE, TRUE, TRUE]

Conclussion

After making a quick comparison between the three files, we can see that the data types of it are the same with 3 observations, and 5 variables.

There is a difference in the JSON file, since the coding structure differs from the other two data types.