Week 7 Assignment

Overview

For this assignment, I created an HTML file, a JSON file, and an XML file that contain information on three of my favorite books and stored them in my github repository. The task is to load this information into three separate R data frames.

HTML

In this code block, I import the html table from my github repository and use the convenient “html_table” and as.data.frame functions to turn it into a data frame in R.

library(rvest)
library(knitr)

htmldata <- read_html("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.html")
htmldata <- html_table(htmldata)
htmldata <- as.data.frame(htmldata)

#These three commands can be done as a single command (shown below).

#htmldata <- as.data.frame(html_table(read_html("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.html")))

kable(htmldata, format = "pipe", caption = "Books Data Frame (HTML Data)", align = "lllccl")

Books Data Frame (HTML Data)
Title	Subtitle	Authors	Pages	Year	Publisher
Driven to Distraction	Recognizing and Coping with Attention Deficit Disorder from Childhood through Adulthood	Edward M. Hallowell, John J. Ratey	319	1994	Pantheon Books
This is Your Brain on Music	The Science of a Human Obsession	Daniel J. Levitin	322	2006	Penguin Group
blink	The Power of Thinking Without Thinking	Malcolm Gladwell	296	2007	Little, Brown and Company

JSON

In this code block, I import a JSON file from my github repository using the JSONLite library, which has a function “fromJSON” that converts it into a data frame.

library(jsonlite)

jsondata <- fromJSON("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.json")

kable(jsondata, format = "pipe", caption = "Books Data Frame (JSON Data)", align = "lllccl")

Books Data Frame (JSON Data)
Title	Subtitle	Authors	Pages	Year	Publisher
Driven to Distraction	Recognizing and Coping with Attention Deficit Disorder from Childhood through Adulthood	Edward M. Hallowell, John J. Ratey	319	1994	Pantheon Books
This is Your Brain on Music	The Science of a Human Obsession	Daniel J. Levitin	322	2006	Penguin Group
blink	The Power of Thinking Without Thinking	Malcolm Gladwell	296	2007	Little, Brown and Company

XML

In this code block, I import an XML file from my github repository using the XML library. I was having an issue reading the file from my github repository, but someone had a similar issue on stackoverflow and the responses suggested that XML has issues with web addresses beginning with “https”, and that this could be resolved using RCurl.

library(XML)
library(RCurl)

xmlwebaddress <- "https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.xml"
RCurladdress <- getURL(xmlwebaddress)
xmldata <- xmlToDataFrame(xmlParse(getURL(xmlwebaddress)))

#Again, this could be done as a single command (shown below).

#xmldata <- xmlToDataFrame(xmlParse(getURL("https://raw.githubusercontent.com/Marley-Myrianthopoulos/data607/main/marleybooks.xml")))

kable(xmldata, format = "pipe", caption = "Books Data Frame (XML Data)", align = "lllccl")

Books Data Frame (XML Data)
Title	Subtitle	Authors	Pages	Year	Publisher
Driven to Distraction	Recognizing and Coping with Attention Deficit Disorder from Childhood Through Adulthod	Edward M. Hallowell, John J. Ratey	319	1994	Pantheon Books
This is Your Brain on Music	The Science of a Human Obsession	Daniel J. Levitin	322	2006	Penguin Group
blink	The Power of Thinking Without Thinking	Malcolm Gladwell	296	2007	Little, Brown and Company

In this code block, I compare the three data frames.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

glimpse(htmldata)

## Rows: 3
## Columns: 6
## $ Title     <chr> "Driven to Distraction", "This is Your Brain on Music", "bli…
## $ Subtitle  <chr> "Recognizing and Coping with Attention Deficit Disorder from…
## $ Authors   <chr> "Edward M. Hallowell, John J. Ratey", "Daniel J. Levitin", "…
## $ Pages     <int> 319, 322, 296
## $ Year      <int> 1994, 2006, 2007
## $ Publisher <chr> "Pantheon Books", "Penguin Group", "Little, Brown and Compan…

glimpse(jsondata)

## Rows: 3
## Columns: 6
## $ Title     <chr> "Driven to Distraction", "This is Your Brain on Music", "bli…
## $ Subtitle  <chr> "Recognizing and Coping with Attention Deficit Disorder from…
## $ Authors   <chr> "Edward M. Hallowell, John J. Ratey", "Daniel J. Levitin", "…
## $ Pages     <int> 319, 322, 296
## $ Year      <int> 1994, 2006, 2007
## $ Publisher <chr> "Pantheon Books", "Penguin Group", "Little, Brown and Compan…

glimpse(xmldata)

## Rows: 3
## Columns: 6
## $ Title     <chr> "Driven to Distraction", "This is Your Brain on Music", "bli…
## $ Subtitle  <chr> "Recognizing and Coping with Attention Deficit Disorder from…
## $ Authors   <chr> "Edward M. Hallowell, John J. Ratey", "Daniel J. Levitin", "…
## $ Pages     <chr> "319", "322", "296"
## $ Year      <chr> "1994", "2006", "2007"
## $ Publisher <chr> "Pantheon Books", "Penguin Group", "Little, Brown and Compan…

Findings and Recommendations

Although their tabular outputs (shown above) appear identical, there is one distinction between the three data frames. The XML data frame columns are all character columns, while the “Pages” and “Year” columns in the HTML and JSON data frames are integers. Going forward, this might necessitate additional lines of code changing column types if I wanted to import XML data and use the resulting data frame to perform calculations. It occurs to me to wonder if this affects how people choose to store data, for example, is information that contains a lot of numeric values usually stored in HTML or JSON formats because of this, or is that not a consideration? In general, I’m curious about what factors influence the selection of JSON, HTML, or XML file types to store data.

Week 7 Assignment

Marley Myrianthopoulos

2023-10-21

Overview

HTML

JSON

XML

Findings and Recommendations