Web Scraping

Setting Up Libraries

I will mainly be using tidyr and dplyr here to do the analysis.
Goal

Pull in book information for three books, each in a different format - XML, JSON, and HTML. I will put the data on my github and read them in using getURL.
The XML

my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/Data607/main/LAB6/books.xml")
#booksXML <- xmlToDataFrame(my_git_url)
booksXML <- xmlToDataFrame(my_git_url)
booksXML

The JSON

my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/Data607/main/LAB6/books.json")

booksJSON <- jsonlite::fromJSON( my_git_url)
booksJSON

The HTML

my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/Data607/main/LAB6/books.html")
booksHTML <- readHTMLTable(my_git_url, which=1)
booksHTML

The solution I attempted for the XML file resulted in some errors when there were multiple authors- I couldn’t figure out how to get around the errors cause by multiple nested children. For the most part, when removing the second author for “Never Split the Difference”, the data frames look similar although not perfectly identical. The “Author” column was not parsed properly from the array in the JSON file.

Web Scraping

Ahmed Elsaeyed

3/17/2022

Setting Up Libraries

Goal

The XML

The JSON

The HTML