The purpose of this assignment is to practice loading HTML, XML, and JSON files into R. I created each of the files in Sublime Text, saved them in three formats, and uploaded them to Github. From there, each file is loaded into R and converted into a dataframe. The three dataframes will then be compared at the end.
library(RCurl)
library(XML)
library(jsonlite)
library(httr)
HTML
HTMLURL <- "https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/data/Books.html"
HTMLdata <- GET(HTMLURL)
HTMLdata <- htmlParse(HTMLdata)
HTMLdata <- readHTMLTable(HTMLdata, stringsAsFactors = F)
HTMLdata <- HTMLdata
BooksHTML_df <- as.data.frame(HTMLdata)
XML
XMLurl <- "https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/data/Books.xml"
XMLdata <- GET(XMLurl)
XMLdata <- xmlTreeParse(XMLdata,useInternal = TRUE)
BooksXML_df <- xmlToDataFrame(XMLdata)
JSON
JSONurl <- "https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/data/books"
JSONdata <- fromJSON(JSONurl)
BooksJSON_df <- as.data.frame(JSONdata)
str(BooksHTML_df)
## 'data.frame': 4 obs. of 6 variables:
## $ NULL.title : chr "Polio: An American Story" "The Argonauts" "Health Justice Now: Single Payer and What Comes Next" "Bodies and Barriers: Queer Activists on Health"
## $ NULL.authors: chr "David Oshinsky" "Maggie Nelson" "Timothy Faust" "Adrian Shanker, Kate Kendell"
## $ NULL.genre : chr "Non-fiction" "Autobiography" "Non-fiction" "Non-fiction"
## $ NULL.ISBN : chr "978-0195307146" "978-1555977351" "978-1612197166" "978-1629637846"
## $ NULL.year : chr "2005" "2016" "2019" "2020"
## $ NULL.pages : chr "342" "160" "272" "240"
str(BooksXML_df)
## 'data.frame': 4 obs. of 6 variables:
## $ title : chr "Polio: An American Story" "The Argonauts" "Health Justice Now: Single Payer and What Comes Next" "Bodies and Barriers: Queer Activists on Health"
## $ authors: chr "David Oshinsky" "Maggie Nelson" "Timothy Faust" "Adrian ShankerKate Kendell"
## $ genre : chr "Non-fiction" "Autobiography" "Non-fiction" "Non-fiction"
## $ ISBN : chr "978-0195307146" "978-1555977351" "978-1612197166" "978-1629637846"
## $ year : chr "2005" "2016" "2019" "2020"
## $ pages : chr "342" "160" "272" "240"
str(BooksJSON_df)
## 'data.frame': 4 obs. of 6 variables:
## $ title : chr "Polio: An American Story" "The Argonauts" "Health Justice Now: Single Payer and What Comes Next" "Bodies and Barriers: Queer Activists on Health"
## $ authors:List of 4
## ..$ : chr "Foster Provost"
## ..$ : chr "Maggie Nelson"
## ..$ : chr "Timothy Faust"
## ..$ : chr "Adrian Shanker" "Kate Kendell"
## $ genre : chr "Non-fiction" "Autobiography" "Non-fiction" "Non-fiction"
## $ ISBN : chr "978-0195307146" "978-1555977351" "978-1612197166" "978-1629637846"
## $ year : chr "2005" "2016" "2019" "2020"
## $ pages : chr "342" "160" "272" "240"
In conclusion, the dataframes all looked relatively similar though the import process looked a bit different for each. RCurl needed to be used for the HTML and XML files, but the JSON file could be imported directly via the fromJSON function. Probably the biggest difference I see in the dataframes is that for the authors column, the HTML and XML are character columns, but in the JSON, it is a list. There are also minor differences in the author strings of the HTML and XML to do with the handling of the double author. Each of the file types loaded a bit differently and required some tidying, such as the aforementioned authors, and column attributes being converted to numeric.