Instructions

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

HTML

library(rvest)

## Warning: package 'rvest' was built under R version 3.6.3

## Loading required package: xml2

  html_txt <- "https://raw.githubusercontent.com/hillt5/DATA607_assignment_3_15_20/master/html_txt"

Convert to Data Frame

html_df <- as.data.frame(html_table(read_html(html_txt)))
html_df

##                                     Title                               Author
## 1 Frankenstein; or, The Modern Prometheus Mary Shelley, \n    Percy B. Shelley
## 2                          Les Misérables                          Victor Hugo
## 3                            Player Piano                        Kurt Vonnegut
##                Genre Year.Published
## 1             horror           1818
## 2 historical fiction           1862
## 3    science fiction           1952

Of note, the break I used in the example with two authors did not carry over and instead added a ‘\n’ instead of the break. I also used an unusual character in the second book to see if this was converted, and it was not.

library(stringr)
html_authors <- str_replace(html_df[1,2], "\\r\\n", "" )
html_df[1,2] <- html_authors
html_accent <- str_replace(html_df[2,1], "Ã©", "e")
html_df[2,1] <- html_accent
html_df

##                                     Title                               Author
## 1 Frankenstein; or, The Modern Prometheus Mary Shelley, \n    Percy B. Shelley
## 2                          Les Misérables                          Victor Hugo
## 3                            Player Piano                        Kurt Vonnegut
##                Genre Year.Published
## 1             horror           1818
## 2 historical fiction           1862
## 3    science fiction           1952

XML

library(XML)

## 
## Attaching package: 'XML'

## The following object is masked from 'package:rvest':
## 
##     xml

xml_txt <- "C:/Users/Thomas/Documents/GitHub/DATA607_assignment_3_15_20/xml_txt"

I was unable to read the url for the xml file directly from GitHub. This is available for download on my repository for this homework.

xml_df <- xmlToDataFrame(xml_txt,homogeneous = TRUE, stringsAsFactors = FALSE)
xml_df

##                                     title        author              genre
## 1 Frankenstein; or, The Modern Prometheus  Mary Shelley   Percy B. Shelley
## 2                          Les Misérables   Victor Hugo historical fiction
## 3                            Player Piano Kurt Vonnegut    science fiction
##   year_published   NA
## 1         horror 1818
## 2           1862 <NA>
## 3           1952 <NA>

Using xmlToDataFrame, I keep throwing the same error about the duplicate authors. The only way I avoid this error is to use the argument homogenous = TRUE, which creates another issue: the second author shifts the rest of the columns for this particular book.

xml_two_authors <- paste(xml_df[[1,2]], as.character(","), xml_df[[1,3]])
xml_df[[1,2]] <- xml_two_authors
xml_df[[1,3]] <- xml_df[[1,4]]
xml_df[[1,4]] <- xml_df[[1,5]]
xml_df <- xml_df[-5]
xml_df

##                                     title                          author
## 1 Frankenstein; or, The Modern Prometheus Mary Shelley , Percy B. Shelley
## 2                          Les Misérables                     Victor Hugo
## 3                            Player Piano                   Kurt Vonnegut
##                genre year_published
## 1             horror           1818
## 2 historical fiction           1862
## 3    science fiction           1952

I ended up using manipulations of the data frame cells to generate the appropriate columns. I pasted the two author cells together, then shifted the contents of the other cells in the affectd row left one column, and finally deleted the rightmost row. There are undoubtedly other solutions to this that would scale better in terms of using identifying XML nodes and XPath commands.

json_txt <- "https://raw.githubusercontent.com/hillt5/DATA607_assignment_3_15_20/master/json_txt"

library(rjson)
json_df <- as.data.frame(fromJSON(file = json_txt))
json_df

##                                     title           author  genre
## 1 Frankenstein; or, The Modern Prometheus     Mary Shelley horror
## 2 Frankenstein; or, The Modern Prometheus Percy B. Shelley horror
##   year_published        title.1    author.1            genre.1 year_published.1
## 1           1818 Les Misérables Victor Hugo historical fiction             1862
## 2           1818 Les Misérables Victor Hugo historical fiction             1862
##        title.2      author.2         genre.2 year_published.2
## 1 Player Piano Kurt Vonnegut science fiction             1952
## 2 Player Piano Kurt Vonnegut science fiction             1952

My first thought was to import using a rjson function, hoever the data frame did not have its intended effect. The two rows are double entries of the book with two authors. Additionally, the two other books are added as extra columns at the end of the two author rows. To figure out the problem, I’ll start by looking at the list originally generated by fromJSON.

json_list <- fromJSON(file = json_txt)
authors_json <- paste(json_list[[1]][[2]][[1]], ",", json_list[[1]][[2]][[2]])
json_list[[1]][[2]] <- authors_json
json_df2 <- as.data.frame(json_list)
json_df2

##                                     title                          author
## 1 Frankenstein; or, The Modern Prometheus Mary Shelley , Percy B. Shelley
##    genre year_published        title.1    author.1            genre.1
## 1 horror           1818 Les Misérables Victor Hugo historical fiction
##   year_published.1      title.2      author.2         genre.2 year_published.2
## 1             1862 Player Piano Kurt Vonnegut science fiction             1952

This manipulation corrected the authors, but it also appears that there is something wrong with the structure of the data that is generating a single row with the other two books appended. I’ll try to fix this with the rbind function.

json_book1 <- json_df2[1:4]
json_book2 <- json_df2[5:8]
names(json_book2) <- names(json_book1)
json_book3 <- json_df2[9:12]
names(json_book3) <- names(json_book1)
json_df3 <- rbind(json_book1, json_book2, json_book3)
json_df3

##                                     title                          author
## 1 Frankenstein; or, The Modern Prometheus Mary Shelley , Percy B. Shelley
## 2                          Les Misérables                     Victor Hugo
## 3                            Player Piano                   Kurt Vonnegut
##                genre year_published
## 1             horror           1818
## 2 historical fiction           1862
## 3    science fiction           1952

After renaming the names of the second two books, I was able to rbind the three books together into a single data frame.

Assignment – Working with XML and JSON in R

Thomas Hill

3/14/2020

Instructions

HTML

Convert to Data Frame

XML