library(tidyverse)
library(rvest)
library(dplyr)
library(lemon)
library("XML")
library("methods")
library("rjson")
knit_print.data.frame <- lemon_print

Introduction into formats outside of .csv

For this assignment, I will read a collection of books off of three different files with R. The files will be .XML, .JSON, .HTML.

All the books will have this table structure: Title, Author(s), Page length, Is it a best seller, Genre. We can see the differences in what each language allows, syntax, if the tables’ values stay the same after it is a data frame.

Book 1 | XML

For our first book, we will load our book information using the XML function in r. It is simple process with the xmlToDataFrame function to load the XML file into a data frame.

For syntax, It did not identify the page length column as a integer type or Boolean column. This is surprising as XML is used for data transfer. There a rising question if the data types becomes lost in data frame transform. In addition, I ran a test for the array for one column and it failed as XML does not support arrays.

book.xml<-xmlToDataFrame("Book_1.xml")
print(book.xml)
##                              TITLE                                    AUTHOR
## 1 \n            Starless\n         \n            Jacqueline Carey \n        
##                    PAGELENGTH                  ISBESTSELLER
## 1 \n            587\n         \n            False\n        
##                                            GENRE
## 1 \n        Fantasy, Historial Fiction\n
summary(book.xml)
TITLE AUTHOR PAGELENGTH ISBESTSELLER GENRE
Length:1 Length:1 Length:1 Length:1 Length:1
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character

Book 2 | JSON

For our second book, I inserted into a JSON file. This book has two authors and multiple genres. So, let us see how R handles a data format with a array of information.

In the data transfer, Json recognized the different data types and kept their values in the transfer to R. However, R cannot handle the array of one row into a data frame directly. It sees the list as unbalance, so I had to do some transformations for the final product.

book.json<-fromJSON(file="Book_2.json")
summary(book.json)
##               Length Class    Mode       
## Title         "1"    "-none-" "character"
## Author(s)     "2"    "-none-" "character"
## page.Length   "1"    "-none-" "numeric"  
## Is.BestSeller "1"    "-none-" "logical"  
## Genre(s)      "3"    "-none-" "character"
book.json<-data.frame(Title=c(book.json$Title,"",""),Author=c(book.json$`Author(s)`[1],book.json$`Author(s)`[2],""),page.length=c(book.json$page.Length,NA,NA),is.bestseller=c(book.json$Is.BestSeller,NA,NA),Genre=c(book.json$`Genre(s)`[1],book.json$`Genre(s)`[2],book.json$`Genre(s)`[3]))

summary(book.json)
Title Author page.length is.bestseller Genre
Length:3 Length:3 Min. :288 Mode :logical Length:3
Class :character Class :character 1st Qu.:288 FALSE:1 Class :character
Mode :character Mode :character Median :288 NAs :2 Mode :character
Mean :288
3rd Qu.:288
Max. :288
NAs :2

Book 3 | HTML

For our last book, we will try HTML’s table to extract the information. In our previous test with XML, the data transfer did not recognize the data types in the table. HTML is a markup language unlike XML, so the possibility it identifies data types is low. In the HTML creation, it was found that arrays are not support in this langue. It follows a sequential format where the column order matches with the values. For the data extraction, we used tidyverse and rvest function.

Surprisingly, HTML recognize the data types of the table. It correctly identify the integer and BOOLEAN columns where the XML failed. HTML does not support array cells like XML, so it is limited if there’s multiple descriptors.

book.html<-"Book_3.html"
book.html<-book.html%>%read_html()%>%html_node("table") %>% html_table()

book.html<-as.data.frame(book.html)
summary(book.html)
Title Author page.length is.BestSeller Genres
Length:1 Length:1 Min. :320 Mode :logical Length:1
Class :character Class :character 1st Qu.:320 FALSE:1 Class :character
Mode :character Mode :character Median :320 Mode :character
Mean :320
3rd Qu.:320
Max. :320

Conclusion

print(as_tibble(book.xml))
## # A tibble: 1 x 5
##   TITLE                              AUTHOR        PAGELENGTH ISBESTSELLER GENRE
##   <chr>                              <chr>         <chr>      <chr>        <chr>
## 1 "\n            Starless\n        " "\n         ~ "\n      ~ "\n        ~ "\n ~
print(as_tibble(book.json))
## # A tibble: 3 x 5
##   Title        Author            page.length is.bestseller Genre    
##   <chr>        <chr>                   <dbl> <lgl>         <chr>    
## 1 "Good Omens" "Terry Pratchett"         288 FALSE         " Horror"
## 2 ""           "Niel Gaiman"              NA NA            "Fantasy"
## 3 ""           ""                         NA NA            "Comedy"
print(as_tibble(book.html))
## # A tibble: 1 x 5
##   Title                Author                 page.length is.BestSeller Genres  
##   <chr>                <chr>                        <int> <lgl>         <chr>   
## 1 All Boys Aren't Blue George Matthew Johnson         320 FALSE         Young A~

All Three languages have their limits on the amount of information and variety. XML and HTML are limited to single values or strings where a separator is needed for extraction. JSON has limited mobility in multiple values, as each column needs to be equal length or there has to be further manipulation for table.

HTML and JSON both identified and kept the data types of the columns in the transfer over to R. XML; whose purpose is in data storage, did not identify any of the data types in document. It transfer all the columns as string. In conclusion, these data frames are not identical.