1. Setting Up Libraries

    I will mainly be using tidyr and dplyr here to do the analysis.

  2. Goal

    Pull in book information for three books, each in a different format - XML, JSON, and HTML. I will put the data on my github and read them in using getURL.

  3. The XML

my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/Data607/main/LAB6/books.xml")
#booksXML <- xmlToDataFrame(my_git_url)
booksXML <- xmlToDataFrame(my_git_url)
booksXML
  1. The JSON

my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/Data607/main/LAB6/books.json")

booksJSON <- jsonlite::fromJSON( my_git_url)
booksJSON
  1. The HTML

my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/Data607/main/LAB6/books.html")
booksHTML <- readHTMLTable(my_git_url, which=1)
booksHTML

The solution I attempted for the XML file resulted in some errors when there were multiple authors- I couldn’t figure out how to get around the errors cause by multiple nested children. For the most part, when removing the second author for “Never Split the Difference”, the data frames look similar although not perfectly identical. The “Author” column was not parsed properly from the array in the JSON file.