Now that I’ve created my books table in html, JSON and XML formats, I’ll get them each into a dataframe to compare both the processes and results.
First, I need to pull in one library for each file format, as well as the tidyverse for dataframe manipulation as needed. I will also bring in RCurl so I can pull my files seamlessly from GitHub.
library(RCurl)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(XML)
library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
I’ll use the read_html function to pull in my file, and html_table to convert it into a dataframe. This function actually returns a list, which I need to index to retrieve the dataframe.
df_html <- getURL('https://raw.githubusercontent.com/LongSockSilver/week7_assignment/main/books.html') %>%
read_html() %>%
html_table() %>%
.[[1]]
The xmlToDataFrame function in the XML package can read in a table from XML cleanly. However, unlike with HTML, the column headers are in the tags rather than between them. That means I couldn’t use tag-forbidden characters like spaces. Therefore, I changed the column names to match my original intention after the fact.
df_xml <- getURL('https://raw.githubusercontent.com/LongSockSilver/week7_assignment/main/books.xml') %>%
xmlToDataFrame()
#Converting titles now that spaces can be present in column headers
colnames(df_xml)[3:4] <- c("Publication Year", "Price on Amazon (Hardcover, USD")
df_xml
## Book
## 1 Bad Blood: Secrets and Lies in a Silicon Valley Startup
## 2 An Ugly Truth: Inside Facebook's Battle for Domination
## 3 The Cult of We: WeWork, Adam Neumann, and the Great Startup Delusion
## Author Publication Year Price on Amazon (Hardcover, USD
## 1 John Carreyrou 2020 14.69
## 2 Sheera Frenkel, Cecilia Kang 2021 14.58
## 3 Eliot Brown, Maureen Farrell 2022 20.41
The fromJSON function in the jsonlite package makes reading in JSON files quite seamless. However, spaces in column headers are a small issue (they are replaced with periods). As with XML, I just had to change
df_json <- getURL('https://raw.githubusercontent.com/LongSockSilver/week7_assignment/main/books.json') %>%
fromJSON()
#Converting titles now that spaces can be present in column headers
colnames(df_json)[3:4] <- c("Publication Year", "Price on Amazon (Hardcover, USD)")
The tables all look quite similar. However, the JSON and HTML files had the requirement to reformat my headers, while the HTML did not.
Further, the XML file read in my two numerical columns (publication year and price) as characters, while my JSON import interpreted them as numbers. That’s not a very difficult fix, but it’s something to pay close attention to when reading in tables from different file formats.
In case there is trouble accessing the files directly from GitHub with RCurl, please download the full repository and access the files from your machine using the following code.
HTML:
df_html <- html_table(read_html('books.html'))[[1]]
XML:
df_xml <- xmlToDataFrame("books.xml")
#Converting titles now that spaces can be present in column headers
colnames(df_xml)[3:4] <- c("Publication Year", "Price on Amazon (Hardcover, USD")
JSON:
df_json <- fromJSON("books.json")
#Converting titles now that spaces can be present in column headers
colnames(df_json)[3:4] <- c("Publication Year", "Price on Amazon (Hardcover, USD)")