DATA607: HTML, JSON, and XML

Introduction

Importing data from a variety of formats is an essential skill in R. In this assignment, I create a simple table on my favorite statistics textbooks, using a variety of formats (HTML, JSON, XML). Then I import these formats into R and store them as dataframes.

I used the packages below.

library(tidyverse)
library(XML)
library(xml2)
library(jsonlite)
library(rvest)

Identical data is stored in three different formats on GitHub.

xml_src <- read_xml("https://raw.githubusercontent.com/dmoscoe/SPS/main/DATA607/Wk7/StatsBooks.xml")
html_src <- read_html("https://raw.githubusercontent.com/dmoscoe/SPS/main/DATA607/Wk7/StatsBooks.html")
json_src <- fromJSON("https://raw.githubusercontent.com/dmoscoe/SPS/main/DATA607/Wk7/StatsBooks.json")

Finally, I converted each data type to an R dataframe.

json_df <- as.data.frame(json_src)

html_df <- html_src %>%
  html_table() %>%
  as.data.frame()

xml_df <- xml_src %>%
  xmlParse() %>%
  xmlToDataFrame()

Let’s examing the dataframes.

html_df

##                                       Title                         Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2                      OpenIntro Statistics    Diez, Cetinkaya-Rundel, Barr
## 3                     Stats Data and Models        De Veaux, Velleman, Bock
##   Length         ISBN Color
## 1    853 5.343774e+08    No
## 2    422 9.781943e+12    No
## 3    939 9.780322e+12   Yes

Before transforming any of the data, it’s interesting to note that the ISBN entries are interpreted as numerics rather than strings. This caused R to drop a leading zero in one of the entries. Let’s transform this column.

html_df$ISBN <- as.character(html_df$ISBN)
html_df[1,4] <- "0534377416"
html_df

##                                       Title                         Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2                      OpenIntro Statistics    Diez, Cetinkaya-Rundel, Barr
## 3                     Stats Data and Models        De Veaux, Velleman, Bock
##   Length          ISBN Color
## 1    853    0534377416    No
## 2    422 9781943450077    No
## 3    939 9780321986498   Yes

The dataframe derived from the JSON source appears below.

json_df

##                                 Books.Title                   Books.Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2                      OpenIntro Statistics    Diez, Cetinkaya-Rundel, Barr
## 3                     Stats Data and Models        De Veaux, Velleman, Bock
##   Books.Length    Books.ISBN Books.Color
## 1          853    0534377416          No
## 2          422 9781943450077          No
## 3          939 9780321986498         Yes

Here we see that each column carries the prefix Books.. JSON files are hierarchical, and when R imports them, each column heading is interpreted as the child of a root node, in this case, Books. Let’s change the column names so that they are consistent with those from the HTML table.

json_df <- rename(json_df, "Title" = "Books.Title",
       "Authors" = "Books.Authors",
       "Length" = "Books.Length",
       "ISBN" = "Books.ISBN",
       "Color" = "Books.Color")
json_df

##                                       Title                         Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2                      OpenIntro Statistics    Diez, Cetinkaya-Rundel, Barr
## 3                     Stats Data and Models        De Veaux, Velleman, Bock
##   Length          ISBN Color
## 1    853    0534377416    No
## 2    422 9781943450077    No
## 3    939 9780321986498   Yes

The dataframe derived from the XML source appears below.

xml_df

##                                       Title                         Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2                      OpenIntro Statistics    Diez, Cetinkaya-Rundel, Barr
## 3                     Stats Data and Models        De Veaux, Velleman, Bock
##   Length          ISBN Color
## 1    853    0534377416    No
## 2    422 9781943450077    No
## 3    939 9780321986498   Yes

This dataframe requires no transformation to be consistent with the other two.

DATA607: HTML, JSON, and XML

Daniel Moscoe

3/16/2021

Introduction

Conclusion