This document takes files organizing information on 3 books in JSON, HTML, and XML and reads them into R data frames.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
##
## Attaching package: 'XML'
## The following object is masked from 'package:rvest':
##
## xml
First, we take the JSON file. I read the file from the Github path, and use fromJSON to transform it. It returns a list with a dataframe as the single list item, so for simplicity’s sake we can index into the list to grab just the dataframe.
url <- 'https://raw.githubusercontent.com/cmm6/data607-assignment7/main/books.json'
# Read URL into data frame
books_json <- fromJSON(txt=url)
books_json <- books_json[[1]]
print(books_json)
## title author year genre pages
## 1 Luster Raven Leilani 2020 fiction 240
## 2 Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction 256
## 3 Disappearing Earth Julia Phillips 2019 fiction 312
Then, we take the XML file. I use xmlToDataFrame and xmlParse to read this into a data frame.
xml_file <- getURL('https://raw.githubusercontent.com/cmm6/data607-assignment7/main/books.xml')
# Read URL into data frame
books_xml <- xmlToDataFrame(xmlParse(xml_file))
print(books_xml)
## title author year genre pages
## 1 Luster Raven Leilani 2020 fiction 240
## 2 Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction 256
## 3 Disappearing Earth Julia Phillips 2019 fiction 312
Then, we take the HTML file. Again, I read the file from the Github path, and use html_nodes to transform it. It also returns a list with a dataframe as the single list item, so again we can index into the list to grab just the dataframe.
html_file <- getURL('https://raw.githubusercontent.com/cmm6/data607-assignment7/main/books.html')
# https://www.r-bloggers.com/2015/01/using-rvest-to-scrape-an-html-table/
# Read URL into a data frame
books_html <- html_file %>%
read_html() %>%
html_nodes('table') %>%
html_table()
books_html <- books_html[[1]]
These imported data frames are mostly similar. There are some case differences where I titled differently, e.g. Title vs. title. The JSON data frame also appended a list number to the front of 1, 2, and 3. The data types are also different - year and pages are an integer in both JSON and HTML, but a character in XML.
print(books_html)
## Title Author Year Genre Pages
## 1 Luster Raven Leilani 2020 fiction 240
## 2 Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction 256
## 3 Disappearing Earth Julia Phillips 2019 fiction 312
print(books_xml)
## title author year genre pages
## 1 Luster Raven Leilani 2020 fiction 240
## 2 Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction 256
## 3 Disappearing Earth Julia Phillips 2019 fiction 312
print(books_json)
## title author year genre pages
## 1 Luster Raven Leilani 2020 fiction 240
## 2 Big Friendship Aminatou Sow, Ann Friedman 2020 nonfiction 256
## 3 Disappearing Earth Julia Phillips 2019 fiction 312