##Assignment - Working with XML and JSON in R
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
##Data Source For this assignment I created three data files in Textpad, saving each file with a different extension - .json, .xml, or .html. I then loaded the files into my Github repository.
#These packages were needed for this assignment, so once I loaded them, I commented them out.
#install.packages('rvest') - for loading html files
#install.packages('XML') - for loading xml files
#install.packages('jsonlite') - for loading json files
library(rvest)
library(XML)
library(jsonlite)
library(httr)
##Loading my “.json” file
# Reading in .json file from github,
gitJson <- "https://raw.githubusercontent.com/carolc57/Data607-Fall23/main/booklist.json"
booklistjson <- fromJSON(gitJson, flatten = TRUE)
head(booklistjson)
## title author_1 author_2 author_3
## 1 Things I Wish I Told My Mother Susan Patterson Susan Dilallo James Patterson
## 2 The Relic Douglas Preston Lincoln Child <NA>
## 3 The Glass Ocean Beatriz Williams Lauren Willis Karen White
## genre num_pgs type ratings
## 1 Fiction, Romance, Contemporary 320 hardcover 3.99
## 2 Horror, Thriller, Mystery 480 paperback 4.05
## 3 Historical Fiction, Romance, Mystery 624 paperback 3.88
#ratings column came in as chr type; need to change to numeric
booklistjson <- transform(booklistjson, ratings = as.numeric(ratings))
class(booklistjson$ratings)
## [1] "numeric"
booklistjson
## title author_1 author_2 author_3
## 1 Things I Wish I Told My Mother Susan Patterson Susan Dilallo James Patterson
## 2 The Relic Douglas Preston Lincoln Child <NA>
## 3 The Glass Ocean Beatriz Williams Lauren Willis Karen White
## genre num_pgs type ratings
## 1 Fiction, Romance, Contemporary 320 hardcover 3.99
## 2 Horror, Thriller, Mystery 480 paperback 4.05
## 3 Historical Fiction, Romance, Mystery 624 paperback 3.88
##Loading my “.html” file
# Reading from HTML
gitHtml <- "https://github.com/carolc57/Data607-Fall23/blob/main/booklist.html"
booklisthtml <- gitHtml %>%
read_html() %>%
html_table(fill = TRUE) %>%
.[[1]]
head(booklisthtml)
## # A tibble: 3 x 6
## title author genre num_pages type ratings
## <chr> <chr> <chr> <int> <chr> <chr>
## 1 Things I Wish I Told My Mother Susan Patterson,~ Fict~ 320 hard~ 3.99
## 2 the Relic Douglas Preston,~ Horr~ 480 soft~ 4.05<
## 3 The Glass Ocean Beatriz Williams~ Hist~ 624 soft~ 3.88
#convert to dataframe
booklisthtml_df <- as.data.frame(booklisthtml)
#ratings column came in as chr type; need to change to numeric
booklisthtml <- transform(booklisthtml, ratings = as.numeric(ratings))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
#ratings column came in as chr type; need to change to numeric
class(booklisthtml$ratings)
## [1] "numeric"
booklisthtml_df
## title
## 1 Things I Wish I Told My Mother
## 2 the Relic
## 3 The Glass Ocean
## author
## 1 Susan Patterson, Susan Dilallo, James Patterson
## 2 Douglas Preston, Lincoln Child
## 3 Beatriz Williams, Lauren Willis, Karen White
## genre num_pages type ratings
## 1 Fiction, Romance, Contemporary 320 hardcover 3.99
## 2 Horror, Thriller, Mystery 480 softcover 4.05<
## 3 Historical Fiction, Romance, Mystery 624 softcover 3.88
##Loading my “.xml” file
library(xml2)
booklistxml <- xmlParse(read_xml('https://raw.githubusercontent.com/carolc57/Data607-Fall23/main/booklist.xml'))
booklistxml
## <?xml version="1.0" encoding="UTF-8"?>
## <Books>
## <Book>
## <title>Things I Wish I Told My Mother</title>
## <author>Susan Patterson, Susan Dilallo, James Patterson</author>
## <genre>Fiction, Romance, Contemporary</genre>
## <pages>320</pages>
## <type>hardcover</type>
## <ratings>3.99</ratings>
## </Book>
## <Book>
## <title>The Relic</title>
## <author>Douglas Preston, Lincoln Child</author>
## <genre>Horror, Thriller, Mystery</genre>
## <pages>480</pages>
## <type>softcover</type>
## <ratings>4.05</ratings>
## </Book>
## <Book>
## <title>The Glass Ocean</title>
## <author>Beatriz Williams, Lauren Willis, Karen White</author>
## <genre>Historical Fiction, Romance, Mystery</genre>
## <pages>624</pages>
## <type>softcover</type>
## <ratings>3.88</ratings>
## </Book>
## </Books>
##
#transform xml file to dataframe
booklistxml_df <- xmlToDataFrame(booklistxml)
booklistxml_df
## title
## 1 Things I Wish I Told My Mother
## 2 The Relic
## 3 The Glass Ocean
## author
## 1 Susan Patterson, Susan Dilallo, James Patterson
## 2 Douglas Preston, Lincoln Child
## 3 Beatriz Williams, Lauren Willis, Karen White
## genre pages type ratings
## 1 Fiction, Romance, Contemporary 320 hardcover 3.99
## 2 Horror, Thriller, Mystery 480 softcover 4.05
## 3 Historical Fiction, Romance, Mystery 624 softcover 3.88
class(booklistxml_df$ratings)
## [1] "character"
#ratings column came in as chr type; need to change to numeric
booklistxml_df <- transform(booklistxml_df, ratings = as.numeric(ratings))
booklistxml_df
## title
## 1 Things I Wish I Told My Mother
## 2 The Relic
## 3 The Glass Ocean
## author
## 1 Susan Patterson, Susan Dilallo, James Patterson
## 2 Douglas Preston, Lincoln Child
## 3 Beatriz Williams, Lauren Willis, Karen White
## genre pages type ratings
## 1 Fiction, Romance, Contemporary 320 hardcover 3.99
## 2 Horror, Thriller, Mystery 480 softcover 4.05
## 3 Historical Fiction, Romance, Mystery 624 softcover 3.88
##Are the three data frames identical?
```r
# Checking if the dataframes are identical
identical(booklistjson, booklistxml_df)
## [1] FALSE
identical(booklisthtml_df, booklistxml_df)
## [1] FALSE
identical(booklistjson, booklisthtml_df)
## [1] FALSE
print("No, the three data frames are not identical to each other.")
## [1] "No, the three data frames are not identical to each other."