For Assignment 5, we were asked to create HTML, XML and JSON files with our favorite books and some information about it. I chose the books: Beautiful Creatures (I never read this one), East of Eden (I love this book) and The Catcher in the Rye (an interesting read).
To read in data from HTML into R, I did play around with it a bit. I even tried webscraping. I ended up liking the results/outcome of the htmltab() method, which was loaded in with the htmltab library. HTML
#rawHTML <- read_html("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.html")
#class(rawHTML)
#as.data.frame(rawHTML)
html <- htmltab("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.html", which =1) %>% as.data.frame()
print(html)
## ID Title Author Year of Publish
## 2 1 Beautiful Creatures Kami Garcia and Margaret Stohl 2009
## 3 2 East of Eden John Steinbeck 2003
## 4 3 The Catcher in the Rye J.D. Salinger 1991
## Publisher Genre
## 2 Little, Brown Books for Young Readers Fantasy
## 3 Viking Fiction
## 4 Little, Brown and Company Young adult fiction
Next, for the xml file to be read in, I had to first apply read_xml(), then xmlParse(), then convert that to a dataframe using xmlToDataFrame. XML
url <- ("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.xml")
url <-read_xml(url)
url<- xmlParse(url)
url <- xmlToDataFrame(url)
class(url)
## [1] "data.frame"
url
## title author publish_year
## 1 Beautiful Creatures Kami Garcia and Margaret Stohl 2009
## 2 East of Eden John Steinbeck 2003
## 3 The Catcher in the Rye J.D. Salinger 1991
## publisher genre
## 1 Little, Brown Books for Young Readers Fantasy
## 2 Viking Fiction
## 3 Little, Brown and Company Young adult fiction
This commented out code for reading in json data is wrong, but I keep it because it helps me learn from it. I did not like the results it created.
#json_url <- ("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.json")
#json_url
#json_file <- fromJSON(file=json_url)
#json_df<- as.data.frame(json_file)
#json_df
After reading through many sources and blogs and experimenting with different ways of reading in json file's data, I really liked the results I got from this. I am happy with it because I finally got to use the lapply() method as well. I have experimented with lapply() before for previous assignments, but finally was able to use it! JSON
json_url <- ("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.json")
json_df <- fromJSON(file=json_url)
df <- lapply(json_df, function(book) # Loop through each "book"
{
# Convert each group to a df.
# This assumes 6 elements each time
data.frame(matrix(unlist(book), ncol=6, byrow=T))
})
# Now you have a list of dfs, connect them together in
# one single df
df <- do.call(rbind, df)
# Make column names nicer, remove row names
colnames(df) <- names(json_df[[1]][[1]])
rownames(df) <- NULL
class(df)
## [1] "data.frame"
df
##
## 1 1 Beautiful Creatures Kami Garcia and Margaret Stohl 2009
## 2 2 East of Eden John Steinbeck 2003
## 3 3 The Catcher in the Rye J.D. Salinger 1991
##
## 1 Little, Brown Books for Young Readers Fantasy
## 2 Viking Fiction
## 3 Little, Brown and Company Young adult fiction
Overall, I liked this assignment because I learned different ways of reading in data for different file formats. The final dataframes created from the 3 files, are very similar, except for the json file where there wasn't a column title. While creating the data, my personal favorite was the JSON file, possibly because it was the most unique version. I am curious to know which one out of the three is used most in industry.
Sources: https://cran.r-project.org/web/packages/htmltab/vignettes/htmltab.html, https://www.tutorialspoint.com/r/r_xml_files.htm, https://www.r-bloggers.com/2015/05/from-json-to-tables/