Assignment 5-607

For Assignment 5, we were asked to create HTML, XML and JSON files with our favorite books and some information about it. I chose the books: Beautiful Creatures (I never read this one), East of Eden (I love this book) and The Catcher in the Rye (an interesting read).

To read in data from HTML into R, I did play around with it a bit. I even tried webscraping. I ended up liking the results/outcome of the htmltab() method, which was loaded in with the htmltab library. HTML

#rawHTML <- read_html("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.html")
#class(rawHTML)
#as.data.frame(rawHTML)


html <- htmltab("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.html", which =1)  %>% as.data.frame()
print(html)

##   ID                  Title                         Author Year of Publish
## 2  1    Beautiful Creatures Kami Garcia and Margaret Stohl            2009
## 3  2           East of Eden                 John Steinbeck            2003
## 4  3 The Catcher in the Rye                  J.D. Salinger            1991
##                               Publisher               Genre
## 2 Little, Brown Books for Young Readers             Fantasy
## 3                                Viking             Fiction
## 4             Little, Brown and Company Young adult fiction

Next, for the xml file to be read in, I had to first apply read_xml(), then xmlParse(), then convert that to a dataframe using xmlToDataFrame. XML

url <- ("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.xml")
url <-read_xml(url)
url<- xmlParse(url)
url <- xmlToDataFrame(url)
class(url)

## [1] "data.frame"

url

##                    title                         author publish_year
## 1    Beautiful Creatures Kami Garcia and Margaret Stohl         2009
## 2           East of Eden                 John Steinbeck         2003
## 3 The Catcher in the Rye                  J.D. Salinger         1991
##                               publisher               genre
## 1 Little, Brown Books for Young Readers             Fantasy
## 2                                Viking             Fiction
## 3             Little, Brown and Company Young adult fiction

This commented out code for reading in json data is wrong, but I keep it because it helps me learn from it. I did not like the results it created.

#json_url <- ("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.json")
#json_url
#json_file <- fromJSON(file=json_url)
#json_df<- as.data.frame(json_file)
#json_df

After reading through many sources and blogs and experimenting with different ways of reading in json file's data, I really liked the results I got from this. I am happy with it because I finally got to use the lapply() method as well. I have experimented with lapply() before for previous assignments, but finally was able to use it! JSON

json_url <- ("https://raw.githubusercontent.com/Sangeetha-007/R-Practice/master/607/Assignments/Assignment%207/books.json")
json_df <- fromJSON(file=json_url) 

df <- lapply(json_df, function(book) # Loop through each "book"
  {
  # Convert each group to a df.
  # This assumes 6 elements each time
  data.frame(matrix(unlist(book), ncol=6, byrow=T))
  })

# Now you have a list of dfs, connect them together in
# one single df
df <- do.call(rbind, df)

# Make column names nicer, remove row names
colnames(df) <- names(json_df[[1]][[1]])
rownames(df) <- NULL
class(df)

## [1] "data.frame"

df

##                                                               
## 1 1    Beautiful Creatures Kami Garcia and Margaret Stohl 2009
## 2 2           East of Eden                 John Steinbeck 2003
## 3 3 The Catcher in the Rye                  J.D. Salinger 1991
##                                                            
## 1 Little, Brown Books for Young Readers             Fantasy
## 2                                Viking             Fiction
## 3             Little, Brown and Company Young adult fiction

Overall, I liked this assignment because I learned different ways of reading in data for different file formats. The final dataframes created from the 3 files, are very similar, except for the json file where there wasn't a column title. While creating the data, my personal favorite was the JSON file, possibly because it was the most unique version. I am curious to know which one out of the three is used most in industry.

Sources: https://cran.r-project.org/web/packages/htmltab/vignettes/htmltab.html, https://www.tutorialspoint.com/r/r_xml_files.htm, https://www.r-bloggers.com/2015/05/from-json-to-tables/

Assignment 5-607

Sangeetha Sasikumar

10/15/2022