Assignment – Working with XML and JSON in R

Loading the necessary packages:

1. Introduction:

I picked three of the books that I currently have in hard copy and I only read some of each; two books are for data science (the language of SQL and Data Science for Business) and the third is to teach me some strategies for classroom management as a teacher (Teach Like a Champion).

2. Loading the books data as html:

url <- getURL('https://raw.githubusercontent.com/SalouaDaouki/Data607/main/Books.html')
books_HTML <- url %>%
  read_html(encoding = 'UTF-8') %>%
  html_table(header = NA, trim = TRUE) %>%
  .[[1]]

books_HTML

## # A tibble: 3 × 6
##   Title                     Author                   Edition  Year Publi…¹ ISBN 
##   <chr>                     <chr>                    <chr>   <int> <chr>   <chr>
## 1 The Language of SQL       Larry Rockoff, Mark Tab… 2nd      2017 Pearso… 978-…
## 2 Data Science for Business Foster Provost, Tom Faw… 1st      2013 0'Reil… 978-…
## 3 Teach Like a Champion 3.0 Doug Lemov, Paul McCart… 3rd      2021 A Wile… 9781…
## # … with abbreviated variable name ¹`Published by`

str(books_HTML)

## tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Title       : chr [1:3] "The Language of SQL" "Data Science for Business" "Teach Like a Champion 3.0"
##  $ Author      : chr [1:3] "Larry Rockoff, Mark Taber" "Foster Provost, Tom Fawcett" "Doug Lemov, Paul McCarthy"
##  $ Edition     : chr [1:3] "2nd" "1st" "3rd"
##  $ Year        : int [1:3] 2017 2013 2021
##  $ Published by: chr [1:3] "Pearson Education, Inc." "0'Reilly Media, Inc" "A Wiley Imprint"
##  $ ISBN        : chr [1:3] "978-0-13-465825-4" "978-1-449-36132-7" "9781119712619"

3. Loading the books data as xml

url <- getURL('https://raw.githubusercontent.com/SalouaDaouki/Data607/main/Books_data.xml')
books_XML <- url %>%
  xmlParse() %>%
  xmlRoot() %>%
  xmlToDataFrame(stringsAsFactors = FALSE)
books_XML

##                       Title Author Edition Year              Publisher
## 1       The language of SQL            2nd 2017 Pearson Education,inc.
## 2 Data Science for Business            1st 2013    O'Reilly Media,inc.
## 3 Teach Like a Champion 3.0            3rd 2021        A Wiley Imprint
##                ISBN
## 1 978-0-13-465825-4
## 2 978-1-449-36132-7
## 3     9781119712619

library(xml2)
Author_attr <- url %>%
  read_xml() %>%
  xml_nodes(xpath = '//Author') %>%
  xml_attrs() %>%
  lapply(function(x) str_c(x, collapse=', ')) %>%
  unlist()

## Warning: `xml_nodes()` was deprecated in rvest 1.0.0.
## ℹ Please use `html_elements()` instead.

Author_attr

## [1] "Larry Rockoff, Mark Taber"   "Foster Provost, Tom Fawcett"
## [3] "Doug Lemov, Paul McCarthy"

books_XML <- books_XML %>% mutate(Author = Author_attr)

books_XML

##                       Title                      Author Edition Year
## 1       The language of SQL   Larry Rockoff, Mark Taber     2nd 2017
## 2 Data Science for Business Foster Provost, Tom Fawcett     1st 2013
## 3 Teach Like a Champion 3.0   Doug Lemov, Paul McCarthy     3rd 2021
##                Publisher              ISBN
## 1 Pearson Education,inc. 978-0-13-465825-4
## 2    O'Reilly Media,inc. 978-1-449-36132-7
## 3        A Wiley Imprint     9781119712619

str(books_XML)

## 'data.frame':    3 obs. of  6 variables:
##  $ Title    : chr  "The language of SQL" "Data Science for Business" "Teach Like a Champion 3.0"
##  $ Author   : chr  "Larry Rockoff, Mark Taber" "Foster Provost, Tom Fawcett" "Doug Lemov, Paul McCarthy"
##  $ Edition  : chr  "2nd" "1st" "3rd"
##  $ Year     : chr  "2017" "2013" "2021"
##  $ Publisher: chr  "Pearson Education,inc." "O'Reilly Media,inc." "A Wiley Imprint"
##  $ ISBN     : chr  "978-0-13-465825-4" "978-1-449-36132-7" "9781119712619"

4. Loading the data as json:

url <- getURL("https://raw.githubusercontent.com/SalouaDaouki/Data607/main/Books.json")
books_JSON <- url %>%
  fromJSON() %>%
  as.data.frame() %>%
  rename_all(funs(str_replace(., 'Books\\.data\\.',''))) %>%
  mutate(Author = unlist(lapply(Author, function(x) str_c(x, collapse = ', '))))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

books_JSON

##                       Title                      Author Edition Year
## 1       The language of SQL   Larry Rockoff, Mark Taber     2nd 2017
## 2 Data Science for Business Foster Provost, Tom Fawcett     1st 2013
## 3 Teach Like a Champion 3.0   Doug Lemov, Paul McCarthy     3rd 2021
##                Publisher              ISBN
## 1 Pearson Education,inc. 978-0-13-465825-4
## 2    O'Reilly Media,inc. 978-1-449-36132-7
## 3        A Wiley Imprint     9781119712619

str(books_JSON)

## 'data.frame':    3 obs. of  6 variables:
##  $ Title    : chr  "The language of SQL" "Data Science for Business" "Teach Like a Champion 3.0"
##  $ Author   : chr  "Larry Rockoff, Mark Taber" "Foster Provost, Tom Fawcett" "Doug Lemov, Paul McCarthy"
##  $ Edition  : chr  "2nd" "1st" "3rd"
##  $ Year     : int  2017 2013 2021
##  $ Publisher: chr  "Pearson Education,inc." "O'Reilly Media,inc." "A Wiley Imprint"
##  $ ISBN     : chr  "978-0-13-465825-4" "978-1-449-36132-7" "9781119712619"

5. Are the three data frames identical?

All frames are identical in size; 3 observations with 6 variables, however;

Based on str(), the variable “Year” in html and json frames is loaded as integer while in the XML frame is loaded as character.
To read all frames, I used, in general the same code (getURL followed by the path), except for html I didn’t need to convert the table into data frame as I needed for XML and json.
After reading the frames into r, all tables were identical.

6. Conclusion:

All my group memebers for project 3 said that this assignment is very simple, which made me feel voiceless. It took me very long time to figure things out in every single assignment since the beginning of the semester. Finally and more importantly is that I did it; I think?!