I have placed three files in the the three formats (XML, JSON, HTML) in a AWS S3 folder so they can be accesed directly from the internet. Links are below:
https://cuny-msds.s3.amazonaws.com/books.xml
https://cuny-msds.s3.amazonaws.com/books.json
https://cuny-msds.s3.amazonaws.com/books.html
We will ready the files, parse them and ensure they are within an R dataframe or tibble.
Let’s load the libraries we will use.
rm(list=ls())
library(tidyverse)
library(XML)
library(xml2)
Let’s load the files directly from our AWS S3 folders.
url <- "https://cuny-msds.s3.amazonaws.com/books.xml"
data_xml <- read_xml(url)
books_xml <- xmlParse(data_xml)
df_books_xml <- xmlToDataFrame(nodes = getNodeSet(books_xml, "//book"))
Let’s check the structure and look at the data
str(df_books_xml)
## 'data.frame': 4 obs. of 6 variables:
## $ title : chr "1984" "Dune" "Farenheit 451" "The Talisman"
## $ author1 : chr "George Orwell" "Frank Herbert" "Ray Bradbury" "Stephen King"
## $ author2 : chr "" "" "" "Peter Straub"
## $ year : chr "1949" "1965" "1965" "2012"
## $ isbn : chr "97804515224935" "0441013597" "9780006546061" "145169721X"
## $ amazon_price: chr "7.48" "14.79" "12.01" "22.86"
Some numerical values were imported as character. So let’s convert them to numerical
df_books_xml$year <- as.numeric(as.character(df_books_xml$year))
df_books_xml$amazon_price <- as.numeric(as.character(df_books_xml$amazon_price))
str(df_books_xml)
## 'data.frame': 4 obs. of 6 variables:
## $ title : chr "1984" "Dune" "Farenheit 451" "The Talisman"
## $ author1 : chr "George Orwell" "Frank Herbert" "Ray Bradbury" "Stephen King"
## $ author2 : chr "" "" "" "Peter Straub"
## $ year : num 1949 1965 1965 2012
## $ isbn : chr "97804515224935" "0441013597" "9780006546061" "145169721X"
## $ amazon_price: num 7.48 14.79 12.01 22.86
df_books_xml
## title author1 author2 year isbn amazon_price
## 1 1984 George Orwell 1949 97804515224935 7.48
## 2 Dune Frank Herbert 1965 0441013597 14.79
## 3 Farenheit 451 Ray Bradbury 1965 9780006546061 12.01
## 4 The Talisman Stephen King Peter Straub 2012 145169721X 22.86
Now it looks all good. Let’s move on!
Let’s load libraries
library("jsonlite")
library("rjson")
Let’s load the data into R.
url <- "https://cuny-msds.s3.amazonaws.com/books.json"
data_json <- jsonlite::fromJSON(url)
df_json <- as.data.frame(data_json)
Let’s check the structure
str(df_json)
## 'data.frame': 4 obs. of 10 variables:
## $ pages : int 328 704 192 944
## $ lang : chr "eng" "eng" "eng" "eng"
## $ title : chr "1984" "Dune" "Farenheit 451" "The Talisman"
## $ author1_first_name: chr "George" "Frank" "Ray" "Stephen"
## $ author1_last_name : chr "Orwell" "Herbert" "Bradbury" "King"
## $ author2_first_name: chr "" "" "" "Peter"
## $ author2_last_name : chr "" "" "" "Straub"
## $ year : int 1949 1965 1965 2012
## $ isbn : chr "97804515224935" "0441013597" "9780006546061" "145169721X"
## $ amazon_price : num 7.48 14.79 12.01 22.86
df_json
## pages lang title author1_first_name author1_last_name
## 1 328 eng 1984 George Orwell
## 2 704 eng Dune Frank Herbert
## 3 192 eng Farenheit 451 Ray Bradbury
## 4 944 eng The Talisman Stephen King
## author2_first_name author2_last_name year isbn amazon_price
## 1 1949 97804515224935 7.48
## 2 1965 0441013597 14.79
## 3 1965 9780006546061 12.01
## 4 Peter Straub 2012 145169721X 22.86
All looks good! let’s continue.
We will use the rvest library to load HTML into R.
library(rvest)
Load the data from the S3 folder in our AWS account
url <- "https://cuny-msds.s3.amazonaws.com/books.html"
data_html <- read_html(url)
table_html <- html_table(data_html, fill=TRUE)[[1]]
Let’s check the structure and look at the table
str(table_html)
## tibble [4 x 10] (S3: tbl_df/tbl/data.frame)
## $ pages : int [1:4] 328 704 192 944
## $ lang : chr [1:4] "eng" "eng" "eng" "eng"
## $ title : chr [1:4] "1984" "Dune" "Farenheit 451" "The Talisman"
## $ author1_first: chr [1:4] "George" "Frank" "Ray" "Stephen"
## $ author1_last : chr [1:4] "Orwell" "Herbert" "Bradbury" "King"
## $ author2_first: chr [1:4] "" "" "" "Peter"
## $ author2_last : chr [1:4] "" "" "" "Straub"
## $ year : int [1:4] 1949 1965 1965 2012
## $ isbn : chr [1:4] "97804515224935" "0441013597" "9780006546061" "145169721X"
## $ amazon_price : num [1:4] 7.48 14.79 12.01 22.86
table_html
## # A tibble: 4 x 10
## pages lang title author1_first author1_last author2_first author2_last year
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <int>
## 1 328 eng 1984 George Orwell "" "" 1949
## 2 704 eng Dune Frank Herbert "" "" 1965
## 3 192 eng Faren~ Ray Bradbury "" "" 1965
## 4 944 eng The T~ Stephen King "Peter" "Straub" 2012
## # ... with 2 more variables: isbn <chr>, amazon_price <dbl>
All looks good, we are done!