Introduction

This document takes files organizing information on 3 books in JSON, HTML, and XML and reads them into R data frames.

Libraries

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
## 
## Attaching package: 'XML'
## The following object is masked from 'package:rvest':
## 
##     xml

JSON

First, we take the JSON file. I read the file from the Github path, and use fromJSON to transform it. It returns a list with a dataframe as the single list item, so for simplicity’s sake we can index into the list to grab just the dataframe.

url <- 'https://raw.githubusercontent.com/cmm6/data607-assignment7/main/books.json'

# Read URL into data frame
books_json <- fromJSON(txt=url)
books_json <- books_json[[1]]
print(books_json)
##                title                     author year      genre pages
## 1             Luster              Raven Leilani 2020    fiction   240
## 2     Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction   256
## 3 Disappearing Earth             Julia Phillips 2019    fiction   312

XML

Then, we take the XML file. I use xmlToDataFrame and xmlParse to read this into a data frame.

xml_file <- getURL('https://raw.githubusercontent.com/cmm6/data607-assignment7/main/books.xml')

# Read URL into data frame
books_xml <- xmlToDataFrame(xmlParse(xml_file))
print(books_xml)
##                 title                     author year      genre pages
## 1              Luster              Raven Leilani 2020    fiction   240
## 2      Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction   256
## 3  Disappearing Earth             Julia Phillips 2019    fiction   312

HTML

Then, we take the HTML file. Again, I read the file from the Github path, and use html_nodes to transform it. It also returns a list with a dataframe as the single list item, so again we can index into the list to grab just the dataframe.

html_file <- getURL('https://raw.githubusercontent.com/cmm6/data607-assignment7/main/books.html')

# https://www.r-bloggers.com/2015/01/using-rvest-to-scrape-an-html-table/
# Read URL into a data frame
books_html <- html_file %>%
  read_html() %>%
  html_nodes('table') %>%
  html_table()
books_html <- books_html[[1]]

Conclusions

These imported data frames are mostly similar. There are some case differences where I titled differently, e.g. Title vs. title. The JSON data frame also appended a list number to the front of 1, 2, and 3. The data types are also different - year and pages are an integer in both JSON and HTML, but a character in XML.

print(books_html)
##                Title                     Author Year      Genre Pages
## 1             Luster              Raven Leilani 2020    fiction   240
## 2     Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction   256
## 3 Disappearing Earth             Julia Phillips 2019    fiction   312
print(books_xml)
##                 title                     author year      genre pages
## 1              Luster              Raven Leilani 2020    fiction   240
## 2      Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction   256
## 3  Disappearing Earth             Julia Phillips 2019    fiction   312
print(books_json)
##                title                     author year      genre pages
## 1             Luster              Raven Leilani 2020    fiction   240
## 2     Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction   256
## 3 Disappearing Earth             Julia Phillips 2019    fiction   312