Introduction

I have placed three files in the the three formats (XML, JSON, HTML) in a AWS S3 folder so they can be accesed directly from the internet. Links are below:

https://cuny-msds.s3.amazonaws.com/books.xml

https://cuny-msds.s3.amazonaws.com/books.json

https://cuny-msds.s3.amazonaws.com/books.html

We will ready the files, parse them and ensure they are within an R dataframe or tibble.

XML to R

Let’s load the libraries we will use.

rm(list=ls())
library(tidyverse)
library(XML)
library(xml2)

Let’s load the files directly from our AWS S3 folders.

url <- "https://cuny-msds.s3.amazonaws.com/books.xml"
data_xml <- read_xml(url)
books_xml <- xmlParse(data_xml)
df_books_xml <- xmlToDataFrame(nodes = getNodeSet(books_xml, "//book"))

Let’s check the structure and look at the data

str(df_books_xml)
## 'data.frame':    4 obs. of  6 variables:
##  $ title       : chr  "1984" "Dune" "Farenheit 451" "The Talisman"
##  $ author1     : chr  "George Orwell" "Frank Herbert" "Ray Bradbury" "Stephen King"
##  $ author2     : chr  "" "" "" "Peter Straub"
##  $ year        : chr  "1949" "1965" "1965" "2012"
##  $ isbn        : chr  "97804515224935" "0441013597" "9780006546061" "145169721X"
##  $ amazon_price: chr  "7.48" "14.79" "12.01" "22.86"

Some numerical values were imported as character. So let’s convert them to numerical

df_books_xml$year <- as.numeric(as.character(df_books_xml$year))
df_books_xml$amazon_price <- as.numeric(as.character(df_books_xml$amazon_price))
str(df_books_xml)
## 'data.frame':    4 obs. of  6 variables:
##  $ title       : chr  "1984" "Dune" "Farenheit 451" "The Talisman"
##  $ author1     : chr  "George Orwell" "Frank Herbert" "Ray Bradbury" "Stephen King"
##  $ author2     : chr  "" "" "" "Peter Straub"
##  $ year        : num  1949 1965 1965 2012
##  $ isbn        : chr  "97804515224935" "0441013597" "9780006546061" "145169721X"
##  $ amazon_price: num  7.48 14.79 12.01 22.86
df_books_xml
##           title       author1      author2 year           isbn amazon_price
## 1          1984 George Orwell              1949 97804515224935         7.48
## 2          Dune Frank Herbert              1965     0441013597        14.79
## 3 Farenheit 451  Ray Bradbury              1965  9780006546061        12.01
## 4  The Talisman  Stephen King Peter Straub 2012     145169721X        22.86

Now it looks all good. Let’s move on!

From json to R

Let’s load libraries

library("jsonlite")
library("rjson")

Let’s load the data into R.

url <- "https://cuny-msds.s3.amazonaws.com/books.json" 
data_json <- jsonlite::fromJSON(url)
df_json <- as.data.frame(data_json)

Let’s check the structure

str(df_json)
## 'data.frame':    4 obs. of  10 variables:
##  $ pages             : int  328 704 192 944
##  $ lang              : chr  "eng" "eng" "eng" "eng"
##  $ title             : chr  "1984" "Dune" "Farenheit 451" "The Talisman"
##  $ author1_first_name: chr  "George" "Frank" "Ray" "Stephen"
##  $ author1_last_name : chr  "Orwell" "Herbert" "Bradbury" "King"
##  $ author2_first_name: chr  "" "" "" "Peter"
##  $ author2_last_name : chr  "" "" "" "Straub"
##  $ year              : int  1949 1965 1965 2012
##  $ isbn              : chr  "97804515224935" "0441013597" "9780006546061" "145169721X"
##  $ amazon_price      : num  7.48 14.79 12.01 22.86
df_json
##   pages lang         title author1_first_name author1_last_name
## 1   328  eng          1984             George            Orwell
## 2   704  eng          Dune              Frank           Herbert
## 3   192  eng Farenheit 451                Ray          Bradbury
## 4   944  eng  The Talisman            Stephen              King
##   author2_first_name author2_last_name year           isbn amazon_price
## 1                                      1949 97804515224935         7.48
## 2                                      1965     0441013597        14.79
## 3                                      1965  9780006546061        12.01
## 4              Peter            Straub 2012     145169721X        22.86

All looks good! let’s continue.

HTML to R

We will use the rvest library to load HTML into R.

library(rvest)

Load the data from the S3 folder in our AWS account

url <- "https://cuny-msds.s3.amazonaws.com/books.html"
data_html <- read_html(url)
table_html <- html_table(data_html, fill=TRUE)[[1]]

Let’s check the structure and look at the table

str(table_html)
## tibble [4 x 10] (S3: tbl_df/tbl/data.frame)
##  $ pages        : int [1:4] 328 704 192 944
##  $ lang         : chr [1:4] "eng" "eng" "eng" "eng"
##  $ title        : chr [1:4] "1984" "Dune" "Farenheit 451" "The Talisman"
##  $ author1_first: chr [1:4] "George" "Frank" "Ray" "Stephen"
##  $ author1_last : chr [1:4] "Orwell" "Herbert" "Bradbury" "King"
##  $ author2_first: chr [1:4] "" "" "" "Peter"
##  $ author2_last : chr [1:4] "" "" "" "Straub"
##  $ year         : int [1:4] 1949 1965 1965 2012
##  $ isbn         : chr [1:4] "97804515224935" "0441013597" "9780006546061" "145169721X"
##  $ amazon_price : num [1:4] 7.48 14.79 12.01 22.86
table_html
## # A tibble: 4 x 10
##   pages lang  title  author1_first author1_last author2_first author2_last  year
##   <int> <chr> <chr>  <chr>         <chr>        <chr>         <chr>        <int>
## 1   328 eng   1984   George        Orwell       ""            ""            1949
## 2   704 eng   Dune   Frank         Herbert      ""            ""            1965
## 3   192 eng   Faren~ Ray           Bradbury     ""            ""            1965
## 4   944 eng   The T~ Stephen       King         "Peter"       "Straub"      2012
## # ... with 2 more variables: isbn <chr>, amazon_price <dbl>

All looks good, we are done!

Thank you!