I picked three of the books that I currently have in hard copy and I only read some of each; two books are for data science (the language of SQL and Data Science for Business) and the third is to teach me some strategies for classroom management as a teacher (Teach Like a Champion).
url <- getURL('https://raw.githubusercontent.com/SalouaDaouki/Data607/main/Books.html')
books_HTML <- url %>%
read_html(encoding = 'UTF-8') %>%
html_table(header = NA, trim = TRUE) %>%
.[[1]]
books_HTML
## # A tibble: 3 × 6
## Title Author Edition Year Publi…¹ ISBN
## <chr> <chr> <chr> <int> <chr> <chr>
## 1 The Language of SQL Larry Rockoff, Mark Tab… 2nd 2017 Pearso… 978-…
## 2 Data Science for Business Foster Provost, Tom Faw… 1st 2013 0'Reil… 978-…
## 3 Teach Like a Champion 3.0 Doug Lemov, Paul McCart… 3rd 2021 A Wile… 9781…
## # … with abbreviated variable name ¹`Published by`
str(books_HTML)
## tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
## $ Title : chr [1:3] "The Language of SQL" "Data Science for Business" "Teach Like a Champion 3.0"
## $ Author : chr [1:3] "Larry Rockoff, Mark Taber" "Foster Provost, Tom Fawcett" "Doug Lemov, Paul McCarthy"
## $ Edition : chr [1:3] "2nd" "1st" "3rd"
## $ Year : int [1:3] 2017 2013 2021
## $ Published by: chr [1:3] "Pearson Education, Inc." "0'Reilly Media, Inc" "A Wiley Imprint"
## $ ISBN : chr [1:3] "978-0-13-465825-4" "978-1-449-36132-7" "9781119712619"
url <- getURL('https://raw.githubusercontent.com/SalouaDaouki/Data607/main/Books_data.xml')
books_XML <- url %>%
xmlParse() %>%
xmlRoot() %>%
xmlToDataFrame(stringsAsFactors = FALSE)
books_XML
## Title Author Edition Year Publisher
## 1 The language of SQL 2nd 2017 Pearson Education,inc.
## 2 Data Science for Business 1st 2013 O'Reilly Media,inc.
## 3 Teach Like a Champion 3.0 3rd 2021 A Wiley Imprint
## ISBN
## 1 978-0-13-465825-4
## 2 978-1-449-36132-7
## 3 9781119712619
library(xml2)
Author_attr <- url %>%
read_xml() %>%
xml_nodes(xpath = '//Author') %>%
xml_attrs() %>%
lapply(function(x) str_c(x, collapse=', ')) %>%
unlist()
## Warning: `xml_nodes()` was deprecated in rvest 1.0.0.
## ℹ Please use `html_elements()` instead.
Author_attr
## [1] "Larry Rockoff, Mark Taber" "Foster Provost, Tom Fawcett"
## [3] "Doug Lemov, Paul McCarthy"
books_XML <- books_XML %>% mutate(Author = Author_attr)
books_XML
## Title Author Edition Year
## 1 The language of SQL Larry Rockoff, Mark Taber 2nd 2017
## 2 Data Science for Business Foster Provost, Tom Fawcett 1st 2013
## 3 Teach Like a Champion 3.0 Doug Lemov, Paul McCarthy 3rd 2021
## Publisher ISBN
## 1 Pearson Education,inc. 978-0-13-465825-4
## 2 O'Reilly Media,inc. 978-1-449-36132-7
## 3 A Wiley Imprint 9781119712619
str(books_XML)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "The language of SQL" "Data Science for Business" "Teach Like a Champion 3.0"
## $ Author : chr "Larry Rockoff, Mark Taber" "Foster Provost, Tom Fawcett" "Doug Lemov, Paul McCarthy"
## $ Edition : chr "2nd" "1st" "3rd"
## $ Year : chr "2017" "2013" "2021"
## $ Publisher: chr "Pearson Education,inc." "O'Reilly Media,inc." "A Wiley Imprint"
## $ ISBN : chr "978-0-13-465825-4" "978-1-449-36132-7" "9781119712619"
url <- getURL("https://raw.githubusercontent.com/SalouaDaouki/Data607/main/Books.json")
books_JSON <- url %>%
fromJSON() %>%
as.data.frame() %>%
rename_all(funs(str_replace(., 'Books\\.data\\.',''))) %>%
mutate(Author = unlist(lapply(Author, function(x) str_c(x, collapse = ', '))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
books_JSON
## Title Author Edition Year
## 1 The language of SQL Larry Rockoff, Mark Taber 2nd 2017
## 2 Data Science for Business Foster Provost, Tom Fawcett 1st 2013
## 3 Teach Like a Champion 3.0 Doug Lemov, Paul McCarthy 3rd 2021
## Publisher ISBN
## 1 Pearson Education,inc. 978-0-13-465825-4
## 2 O'Reilly Media,inc. 978-1-449-36132-7
## 3 A Wiley Imprint 9781119712619
str(books_JSON)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "The language of SQL" "Data Science for Business" "Teach Like a Champion 3.0"
## $ Author : chr "Larry Rockoff, Mark Taber" "Foster Provost, Tom Fawcett" "Doug Lemov, Paul McCarthy"
## $ Edition : chr "2nd" "1st" "3rd"
## $ Year : int 2017 2013 2021
## $ Publisher: chr "Pearson Education,inc." "O'Reilly Media,inc." "A Wiley Imprint"
## $ ISBN : chr "978-0-13-465825-4" "978-1-449-36132-7" "9781119712619"
All frames are identical in size; 3 observations with 6 variables, however;
Based on str(), the variable “Year” in html and json frames is loaded as integer while in the XML frame is loaded as character.
To read all frames, I used, in general the same code (getURL followed by the path), except for html I didn’t need to convert the table into data frame as I needed for XML and json.
After reading the frames into r, all tables were identical.
All my group memebers for project 3 said that this assignment is very simple, which made me feel voiceless. It took me very long time to figure things out in every single assignment since the beginning of the semester. Finally and more importantly is that I did it; I think?!