Homework 7

Introduction into formats outside of .csv

For this assignment, I will read a collection of books off of three different files with R. The files will be .XML, .JSON, .HTML.

All the books will have this table structure: Title, Author(s), Page length, Is it a best seller, Genre. We can see the differences in what each language allows, syntax, if the tables’ values stay the same after it is a data frame.

Book 1 | XML

For our first book, we will load our book information using the XML function in r. It is simple process with the xmlToDataFrame function to load the XML file into a data frame.

For syntax, It did not identify the page length column as a integer type or Boolean column. This is surprising as XML is used for data transfer. There a rising question if the data types becomes lost in data frame transform. In addition, I ran a test for the array for one column and it failed as XML does not support arrays.

book.xml<-xmlToDataFrame("Book_1.xml")
print(book.xml)

##                              TITLE                                    AUTHOR
## 1 \n            Starless\n         \n            Jacqueline Carey \n        
##                    PAGELENGTH                  ISBESTSELLER
## 1 \n            587\n         \n            False\n        
##                                            GENRE
## 1 \n        Fantasy, Historial Fiction\n

summary(book.xml)

TITLE	AUTHOR	PAGELENGTH	ISBESTSELLER	GENRE
Length:1	Length:1	Length:1	Length:1	Length:1
Class :character	Class :character	Class :character	Class :character	Class :character
Mode :character	Mode :character	Mode :character	Mode :character	Mode :character

Book 2 | JSON

For our second book, I inserted into a JSON file. This book has two authors and multiple genres. So, let us see how R handles a data format with a array of information.

In the data transfer, Json recognized the different data types and kept their values in the transfer to R. However, R cannot handle the array of one row into a data frame directly. It sees the list as unbalance, so I had to do some transformations for the final product.

book.json<-fromJSON(file="Book_2.json")
summary(book.json)

##               Length Class    Mode       
## Title         "1"    "-none-" "character"
## Author(s)     "2"    "-none-" "character"
## page.Length   "1"    "-none-" "numeric"  
## Is.BestSeller "1"    "-none-" "logical"  
## Genre(s)      "3"    "-none-" "character"

book.json<-data.frame(Title=c(book.json$Title,"",""),Author=c(book.json$`Author(s)`[1],book.json$`Author(s)`[2],""),page.length=c(book.json$page.Length,NA,NA),is.bestseller=c(book.json$Is.BestSeller,NA,NA),Genre=c(book.json$`Genre(s)`[1],book.json$`Genre(s)`[2],book.json$`Genre(s)`[3]))

summary(book.json)

Title	Author	page.length	is.bestseller	Genre
Length:3	Length:3	Min. :288	Mode :logical	Length:3
Class :character	Class :character	1st Qu.:288	FALSE:1	Class :character
Mode :character	Mode :character	Median :288	`NA`s :2	Mode :character
		Mean :288
		3rd Qu.:288
		Max. :288
		`NA`s :2

Book 3 | HTML

For our last book, we will try HTML’s table to extract the information. In our previous test with XML, the data transfer did not recognize the data types in the table. HTML is a markup language unlike XML, so the possibility it identifies data types is low. In the HTML creation, it was found that arrays are not support in this langue. It follows a sequential format where the column order matches with the values. For the data extraction, we used tidyverse and rvest function.

Surprisingly, HTML recognize the data types of the table. It correctly identify the integer and BOOLEAN columns where the XML failed. HTML does not support array cells like XML, so it is limited if there’s multiple descriptors.

book.html<-"Book_3.html"
book.html<-book.html%>%read_html()%>%html_node("table") %>% html_table()

book.html<-as.data.frame(book.html)
summary(book.html)

Title	Author	page.length	is.BestSeller	Genres
Length:1	Length:1	Min. :320	Mode :logical	Length:1
Class :character	Class :character	1st Qu.:320	FALSE:1	Class :character
Mode :character	Mode :character	Median :320		Mode :character
		Mean :320
		3rd Qu.:320
		Max. :320

Conclusion

print(as_tibble(book.xml))

## # A tibble: 1 x 5
##   TITLE                              AUTHOR        PAGELENGTH ISBESTSELLER GENRE
##   <chr>                              <chr>         <chr>      <chr>        <chr>
## 1 "\n            Starless\n        " "\n         ~ "\n      ~ "\n        ~ "\n ~

print(as_tibble(book.json))

## # A tibble: 3 x 5
##   Title        Author            page.length is.bestseller Genre    
##   <chr>        <chr>                   <dbl> <lgl>         <chr>    
## 1 "Good Omens" "Terry Pratchett"         288 FALSE         " Horror"
## 2 ""           "Niel Gaiman"              NA NA            "Fantasy"
## 3 ""           ""                         NA NA            "Comedy"

print(as_tibble(book.html))

## # A tibble: 1 x 5
##   Title                Author                 page.length is.BestSeller Genres  
##   <chr>                <chr>                        <int> <lgl>         <chr>   
## 1 All Boys Aren't Blue George Matthew Johnson         320 FALSE         Young A~

All Three languages have their limits on the amount of information and variety. XML and HTML are limited to single values or strings where a separator is needed for extraction. JSON has limited mobility in multiple values, as each column needs to be equal length or there has to be further manipulation for table.

HTML and JSON both identified and kept the data types of the columns in the transfer over to R. XML; whose purpose is in data storage, did not identify any of the data types in document. It transfer all the columns as string. In conclusion, these data frames are not identical.