knitr::opts_chunk$set(eval = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(rvest)
library(rjson)
library(XML)
library(xml2)
library(RCurl)
library(data.table)
library(knitr)
The html code was put into the notepad and saved as
.html. After that the file in .html format was upload to
github repo from where it was loaded into Rstudio. Following is the code
for table in HTML:
<table>
<tr>
<th>ID</th>
<th>Book</th>
<th>Author1</th>
<th>Author2</th>
<th>Year_Pub</th>
<th>Edition</th>
<th>Subject</th>
</tr>
<tr>
<th>1.0</th>
<td>Fundamentals of Thermodynamics</td>
<td>Michael J. Moran</td>
<td>Howard N. Shapiro</td>
<td>2020.0</td>
<td>Eight</td>
<td>Engineering</td>
</tr>
<tr>
<th>2.0</th>
<td>Relic</td>
<td>Lincoln Child</td>
<td>Douglas Preston</td>
<td>2019.0</td>
<td>First</td>
<td>Fiction</td>
</tr>
<tr>
<th>3.0</th>
<td>Heads you lose</td>
<td>Lisa Lutz</td>
<td>David Hayward</td>
<td>2017.0</td>
<td>First</td>
<td>Data Science</td>
</tr>
</table>
The .html table was loaded from github repo using
read_html() function from rvest Library. After
reading the html the table was loaded as data frame with the help
html_table() function from rvest and
as.data.frame() function from Base R
url <- "https://raw.githubusercontent.com/Umerfarooq122/link-files/main/table.html"
table1 <- read_html(url)
book_html <- as.data.frame(table1 |> html_table())
kable(head(book_html))
| ID | Book | Author1 | Author2 | Year_Pub | Edition | Subject |
|---|---|---|---|---|---|---|
| 1 | Fundamentals of Thermodynamics | Michael J. Moran | Howard N. Shapiro | 2020 | Eight | Engineering |
| 2 | Relic | Lincoln Child | Douglas Preston | 2019 | First | Fiction |
| 3 | Heads you lose | Lisa Lutz | David Hayward | 2017 | First | Data Science |
The same process was done for JSON as we did for html. Initially, the
the JSON code was put into notepad as saved as .json to
change the format. Afterwards, the .json was uploaded to github repo
from where it was loaded in Rstudio environment. Following is the code
for table in JSON
{
"ID":[1,2,3],
"Book":["Fundamentals of Thermodynamics","Relic","Heads you lose" ],
"Author1":["Michael J. Moran","Lincoln Child","Lisa Lutz"],
"Author2":["Howard N. Shapiro","Douglas Preston","David Hayward"],
"Year_Pub":[2020,2019,2017],
"Edition":["Eight","First","First"],
"Subject":["Engineering","Fiction","Data Science"]
}
The JSON table was loaded from github repo into Rstudio. Similar to
html, JSON table was loaded using fromJSON() from
rjson library and stored as data frame using
as.data.frame() function from Base R
url <- "https://raw.githubusercontent.com/Umerfarooq122/link-files/main/table.json"
mydata <- fromJSON(file = url)
book_json <- as.data.frame(mydata)
kable(head(book_json))
| ID | Book | Author1 | Author2 | Year_Pub | Edition | Subject |
|---|---|---|---|---|---|---|
| 1 | Fundamentals of Thermodynamics | Michael J. Moran | Howard N. Shapiro | 2020 | Eight | Engineering |
| 2 | Relic | Lincoln Child | Douglas Preston | 2019 | First | Fiction |
| 3 | Heads you lose | Lisa Lutz | David Hayward | 2017 | First | Data Science |
The XML table was created using Microsoft notepad++ and
then code was saved in github repo. Following is the code for XML
table:
<?xml version="1.0" encoding="UTF-8"?>
<book>
<book1>
<ID>1</ID>
<Book>Fundamentals of Thermodynamics</Book>
<Author1>Michael J. Moran</Author1>
<Author2>Howard N. Shapiro</Author2>
<Year_Pub>2020</Year_Pub>
<Edition>Eight</Edition>
<Subject>Engineering</Subject>
</book1>
<book2>
<ID>2</ID>
<Book>Relic</Book>
<Author1>Lincoln Child</Author1>
<Author2>Douglas Preston</Author2>
<Year_Pub>2019</Year_Pub>
<Edition>First</Edition>
<Subject>Fiction</Subject>
</book2>
<book3>
<ID>3</ID>
<Book>Heads you lose</Book>
<Author1>Lisa Lutz</Author1>
<Author2>David Hayward</Author2>
<Year_Pub>2017</Year_Pub>
<Edition>First</Edition>
<Subject>Data Science</Subject>
</book3>
</book>
The xml table was loaded from github repo using
getURL()function from RCURL library.
Afterwards the with help of xmlToDataFrame() function from
XML library the XML table was loaded in the form of data
frame:
a <- "https://raw.githubusercontent.com/Umerfarooq122/link-files/main/table.xml"
data <- getURL(a)
book_xml <- xmlToDataFrame(data)
kable(head(book_xml))
| ID | Book | Author1 | Author2 | Year_Pub | Edition | Subject |
|---|---|---|---|---|---|---|
| 1 | Fundamentals of Thermodynamics | Michael J. Moran | Howard N. Shapiro | 2020 | Eight | Engineering |
| 2 | Relic | Lincoln Child | Douglas Preston | 2019 | First | Fiction |
| 3 | Heads you lose | Lisa Lutz | David Hayward | 2017 | First | Data Science |
In this particular section the data frame from different sources are bieng compared if they are similar or if there is any difference:
With the help of all.equal() function from
Base R we will be comparing the two data frames. The reason
why I have chosen all.equal() function over
identical() function is that all.equal()
function gives the reason or the exact points where the data frames
differ from each other.
all.equal(book_xml, book_json)
## [1] "Component \"ID\": Modes: character, numeric"
## [2] "Component \"ID\": target is character, current is numeric"
## [3] "Component \"Year_Pub\": Modes: character, numeric"
## [4] "Component \"Year_Pub\": target is character, current is numeric"
Now we can see that both data frames are pretty much equal in all other aspects but just one and that is the data type. In XML table everything is loaded as characters while on the other hand the JSON loaded the numeric values as numeric rather than characters. Lets compare XML to HTML
all.equal(book_xml, book_html)
## [1] "Component \"ID\": Modes: character, numeric"
## [2] "Component \"ID\": target is character, current is numeric"
## [3] "Component \"Year_Pub\": Modes: character, numeric"
## [4] "Component \"Year_Pub\": target is character, current is numeric"
And we can see the similar results. HTML table also loaded with numeric values as type numeric. This implies that JSON and HTML should return TRUE. Lets compare and find out.
all.equal(book_html, book_json)
## [1] TRUE
As we expected, it did return True for HTML and JSON.
We saw that after comparing the data frames there was an issue with the data types of XML as HTML and JSON did load the numbers as numeric but numbers in XML were loaded as characters. Upon some research it was found that we can fix this issue by defining the data with the help of a schema at top of XML file but the idea was to load the tables as they are created in these different file formats and compare them.
Note: The attributes of the books in table might not be right since the whole point was to work with different types of files.