Overview

For this assignment, I created an HTML file, a JSON file, and an XML file that contain information on three of my favorite books and stored them in my github repository. The task is to load this information into three separate R data frames.

HTML

In this code block, I import the html table from my github repository and use the convenient “html_table” and as.data.frame functions to turn it into a data frame in R.

library(rvest)
library(knitr)

htmldata <- read_html("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.html")
htmldata <- html_table(htmldata)
htmldata <- as.data.frame(htmldata)

#These three commands can be done as a single command (shown below).

#htmldata <- as.data.frame(html_table(read_html("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.html")))

kable(htmldata, format = "pipe", caption = "Books Data Frame (HTML Data)", align = "lllccl")
Books Data Frame (HTML Data)
Title Subtitle Authors Pages Year Publisher
Driven to Distraction Recognizing and Coping with Attention Deficit Disorder from Childhood through Adulthood Edward M. Hallowell, John J. Ratey 319 1994 Pantheon Books
This is Your Brain on Music The Science of a Human Obsession Daniel J. Levitin 322 2006 Penguin Group
blink The Power of Thinking Without Thinking Malcolm Gladwell 296 2007 Little, Brown and Company

JSON

In this code block, I import a JSON file from my github repository using the JSONLite library, which has a function “fromJSON” that converts it into a data frame.

library(jsonlite)

jsondata <- fromJSON("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.json")

kable(jsondata, format = "pipe", caption = "Books Data Frame (JSON Data)", align = "lllccl")
Books Data Frame (JSON Data)
Title Subtitle Authors Pages Year Publisher
Driven to Distraction Recognizing and Coping with Attention Deficit Disorder from Childhood through Adulthood Edward M. Hallowell, John J. Ratey 319 1994 Pantheon Books
This is Your Brain on Music The Science of a Human Obsession Daniel J. Levitin 322 2006 Penguin Group
blink The Power of Thinking Without Thinking Malcolm Gladwell 296 2007 Little, Brown and Company

XML

In this code block, I import an XML file from my github repository using the XML library. I was having an issue reading the file from my github repository, but someone had a similar issue on stackoverflow and the responses suggested that XML has issues with web addresses beginning with “https”, and that this could be resolved using RCurl.

library(XML)
library(RCurl)

xmlwebaddress <- "https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.xml"
RCurladdress <- getURL(xmlwebaddress)
xmldata <- xmlToDataFrame(xmlParse(getURL(xmlwebaddress)))

#Again, this could be done as a single command (shown below).

#xmldata <- xmlToDataFrame(xmlParse(getURL("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.xml")))

kable(xmldata, format = "pipe", caption = "Books Data Frame (XML Data)", align = "lllccl")
Books Data Frame (XML Data)
Title Subtitle Authors Pages Year Publisher
Driven to Distraction Recognizing and Coping with Attention Deficit Disorder from Childhood Through Adulthod Edward M. Hallowell, John J. Ratey 319 1994 Pantheon Books
This is Your Brain on Music The Science of a Human Obsession Daniel J. Levitin 322 2006 Penguin Group
blink The Power of Thinking Without Thinking Malcolm Gladwell 296 2007 Little, Brown and Company

In this code block, I compare the three data frames.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(htmldata)
## Rows: 3
## Columns: 6
## $ Title     <chr> "Driven to Distraction", "This is Your Brain on Music", "bli…
## $ Subtitle  <chr> "Recognizing and Coping with Attention Deficit Disorder from…
## $ Authors   <chr> "Edward M. Hallowell, John J. Ratey", "Daniel J. Levitin", "…
## $ Pages     <int> 319, 322, 296
## $ Year      <int> 1994, 2006, 2007
## $ Publisher <chr> "Pantheon Books", "Penguin Group", "Little, Brown and Compan…
glimpse(jsondata)
## Rows: 3
## Columns: 6
## $ Title     <chr> "Driven to Distraction", "This is Your Brain on Music", "bli…
## $ Subtitle  <chr> "Recognizing and Coping with Attention Deficit Disorder from…
## $ Authors   <chr> "Edward M. Hallowell, John J. Ratey", "Daniel J. Levitin", "…
## $ Pages     <int> 319, 322, 296
## $ Year      <int> 1994, 2006, 2007
## $ Publisher <chr> "Pantheon Books", "Penguin Group", "Little, Brown and Compan…
glimpse(xmldata)
## Rows: 3
## Columns: 6
## $ Title     <chr> "Driven to Distraction", "This is Your Brain on Music", "bli…
## $ Subtitle  <chr> "Recognizing and Coping with Attention Deficit Disorder from…
## $ Authors   <chr> "Edward M. Hallowell, John J. Ratey", "Daniel J. Levitin", "…
## $ Pages     <chr> "319", "322", "296"
## $ Year      <chr> "1994", "2006", "2007"
## $ Publisher <chr> "Pantheon Books", "Penguin Group", "Little, Brown and Compan…

Findings and Recommendations

Although their tabular outputs (shown above) appear identical, there is one distinction between the three data frames. The XML data frame columns are all character columns, while the “Pages” and “Year” columns in the HTML and JSON data frames are integers. Going forward, this might necessitate additional lines of code changing column types if I wanted to import XML data and use the resulting data frame to perform calculations. It occurs to me to wonder if this affects how people choose to store data, for example, is information that contains a lot of numeric values usually stored in HTML or JSON formats because of this, or is that not a consideration? In general, I’m curious about what factors influence the selection of JSON, HTML, or XML file types to store data.