Assignment 7

Introduction

The purpose of this assignment it to familiarize ourselves with different formats of stored data. I have created 3 identical tables in 3 different file types (XML, HTML, and JSON). We will now be loading these files into our R environment.

XML

Here we load in our XML file and take a look at the structure.

library(xml2)
library(XML)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

books_XML <- read_xml(url("https://raw.githubusercontent.com/bwolin99/TestRepo/refs/heads/main/Assignment%207/Books.XML"))
xml_structure(books_XML)

## <table>
##   <book>
##     <Title>
##       {text}
##     <Author1>
##       {text}
##     <Author2>
##       {text}
##     <Genre>
##       {text}
##   <book>
##     <Title>
##       {text}
##     <Author1>
##       {text}
##     <Author2>
##     <Genre>
##       {text}
##   <book>
##     <Title>
##       {text}
##     <Author1>
##       {text}
##     <Author2>
##     <Genre>
##       {text}

Next we extract the columns and put them into an R dataframe.

Names <- books_XML %>%
  xml_find_all("//Title") %>%
  xml_text()
Author1 <- books_XML %>%
  xml_find_all("//Author1") %>%
  xml_text()
Author2 <- books_XML %>%
  xml_find_all("//Author2") %>%
  xml_text()
Genre <- books_XML %>%
  xml_find_all("//Genre") %>%
  xml_text()

books_xml_final <- data.frame("Title" = Names, "Author 1" = Author1, "Author 2" = Author2, "Genre" = Genre) 

books_xml_final

##                                  Title        Author.1        Author.2
## 1                            Stardance Spider Robinson Jeanne Robinson
## 2                             Hyperion     Dan Simmons                
## 3 Do Androids Dream of Electric Sheep?  Philip K. Dick                
##              Genre
## 1           Sci Fi
## 2 Sci Fi, Thriller
## 3 Sci Fi, Thriller

HTML

Here we load in our HTML table and tidy it to be a neet dataframe.

library(rvest)
books_html <- read_html(url("https://raw.githubusercontent.com/bwolin99/TestRepo/refs/heads/main/Assignment%207/Books.html"))

books_html_final <- books_html %>%
  html_element("body") %>%
  html_table()

books_html_final

## # A tibble: 3 × 4
##   Title                                `Author 1`      `Author 2`        Genre  
##   <chr>                                <chr>           <chr>             <chr>  
## 1 Stardance                            Spider Robinson "Jeanne Robinson" Sci Fi 
## 2 Hyperion                             Dan Simmons     ""                Sci Fi…
## 3 Do Androids Dream of Electric Sheep? Philip K. Dick  ""                Sci Fi…

JSON

With the jsonlite library, we don’t even need to tidy our JSON file, the fromJson function will do this for us.

library(jsonlite)
books_json <- fromJSON(url("https://raw.githubusercontent.com/bwolin99/TestRepo/refs/heads/main/Assignment%207/Books.json"))

books_json

##                                  Title         Author1         Author2
## 1                            Stardance Spider Robinson Jeanne Robinson
## 2                             Hyperion     Dan Simmons                
## 3 Do Androids Dream of Electric Sheep?  Philip K. Dick                
##              Genre
## 1           Sci Fi
## 2 Sci Fi, Thriller
## 3 Sci Fi, Thriller

Assignment 7

Ben Wolin

2024-10-13

Introduction

XML

HTML

JSON