Importing data from a variety of formats is an essential skill in R. In this assignment, I create a simple table on my favorite statistics textbooks, using a variety of formats (HTML, JSON, XML). Then I import these formats into R and store them as dataframes.
I used the packages below.
library(tidyverse)
library(XML)
library(xml2)
library(jsonlite)
library(rvest)
Identical data is stored in three different formats on GitHub.
xml_src <- read_xml("https://raw.githubusercontent.com/dmoscoe/SPS/main/DATA607/Wk7/StatsBooks.xml")
html_src <- read_html("https://raw.githubusercontent.com/dmoscoe/SPS/main/DATA607/Wk7/StatsBooks.html")
json_src <- fromJSON("https://raw.githubusercontent.com/dmoscoe/SPS/main/DATA607/Wk7/StatsBooks.json")
Finally, I converted each data type to an R dataframe.
json_df <- as.data.frame(json_src)
html_df <- html_src %>%
html_table() %>%
as.data.frame()
xml_df <- xml_src %>%
xmlParse() %>%
xmlToDataFrame()
Let’s examing the dataframes.
html_df
## Title Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2 OpenIntro Statistics Diez, Cetinkaya-Rundel, Barr
## 3 Stats Data and Models De Veaux, Velleman, Bock
## Length ISBN Color
## 1 853 5.343774e+08 No
## 2 422 9.781943e+12 No
## 3 939 9.780322e+12 Yes
Before transforming any of the data, it’s interesting to note that the ISBN entries are interpreted as numerics rather than strings. This caused R to drop a leading zero in one of the entries. Let’s transform this column.
html_df$ISBN <- as.character(html_df$ISBN)
html_df[1,4] <- "0534377416"
html_df
## Title Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2 OpenIntro Statistics Diez, Cetinkaya-Rundel, Barr
## 3 Stats Data and Models De Veaux, Velleman, Bock
## Length ISBN Color
## 1 853 0534377416 No
## 2 422 9781943450077 No
## 3 939 9780321986498 Yes
The dataframe derived from the JSON source appears below.
json_df
## Books.Title Books.Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2 OpenIntro Statistics Diez, Cetinkaya-Rundel, Barr
## 3 Stats Data and Models De Veaux, Velleman, Bock
## Books.Length Books.ISBN Books.Color
## 1 853 0534377416 No
## 2 422 9781943450077 No
## 3 939 9780321986498 Yes
Here we see that each column carries the prefix Books.. JSON files are hierarchical, and when R imports them, each column heading is interpreted as the child of a root node, in this case, Books. Let’s change the column names so that they are consistent with those from the HTML table.
json_df <- rename(json_df, "Title" = "Books.Title",
"Authors" = "Books.Authors",
"Length" = "Books.Length",
"ISBN" = "Books.ISBN",
"Color" = "Books.Color")
json_df
## Title Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2 OpenIntro Statistics Diez, Cetinkaya-Rundel, Barr
## 3 Stats Data and Models De Veaux, Velleman, Bock
## Length ISBN Color
## 1 853 0534377416 No
## 2 422 9781943450077 No
## 3 939 9780321986498 Yes
The dataframe derived from the XML source appears below.
xml_df
## Title Authors
## 1 Mathematical Statistics With Applications Wackerly, Mendenhall, Scheaffer
## 2 OpenIntro Statistics Diez, Cetinkaya-Rundel, Barr
## 3 Stats Data and Models De Veaux, Velleman, Bock
## Length ISBN Color
## 1 853 0534377416 No
## 2 422 9781943450077 No
## 3 939 9780321986498 Yes
This dataframe requires no transformation to be consistent with the other two.
Understanding the format and structure of imported data is essential for working with it in R. There are commonalities across packages designed for the import of specific data types. Even so, each distinct data type requires some specific skills.