In this assignment the objective was to pick three of my favorite books on one of my favorite subjects, where at least one of the books should have more than one author. For each book, I needed to document the title, author(s), and two or three other attributes that I found interesting. Then, take the information that I had selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). Lastly, write R code, using packages of my choice, to load the information from each of the three sources into separate R data frames. Finally, note if the three data frames are identical or not.
To begin this assignment - I chose three books relevant to data science ethics that I found interesting and I included the title, author(s), as well as a link for where the book can be accessed or purchased and a link to a relevant talk on youtube about the book. Below is the information for each book.
Title: Data Feminism
Author(s): Catherine D’Ignazio and Lauren F. Klein
Link to book: https://data-feminism.mitpress.mit.edu/
Link to relevant youtube talk: https://www.youtube.com/watch?v=guIxU_hK4aY&ab_channel=LSE
Title: Artificial Unintelligence
Author: Meredith Broussard
Link to book: https://mitpress.mit.edu/9780262537018/artificial-unintelligence/
Link to relevant youtube talk: https://www.youtube.com/watch?v=3PIpCD_hO-g&ab_channel=Triangulation
Title: Algorithms of Oppression
Author: Safiya Umoja Noble
Link to book: https://nyupress.org/9781479837243/algorithms-of-oppression/
Link to relevant youtube talk: https://www.youtube.com/watch?v=UXuJ8yQf6dI&ab_channel=TEDxTalks
To start, I will load the required packages. The rvest
library assists in getting data from a html file into R. The
jsonlite
library assists in getting data from a json file
into R. The XML
and methods
libraries assist
in getting data from an xml file into R.
library(kableExtra)
library(rvest)
library(tidyverse)
library("XML")
library("methods")
library(jsonlite)
library(httr)
Use rvest functions to read the html table into an r dataframe.
github_link <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_hw7/main/books.html"
temp_file <- tempfile(fileext = ".html")
req <- GET(github_link,
# write result to disk
write_disk(path = temp_file))
html <- read_html(temp_file)
df_html <- html |>
html_element("table") |>
html_table()
df_html <- df_html |>
rename("title"= "Name",
"author_s" = "Author(s)- (and separated)",
"book_link" = "Book Link",
"youtube_link" = "Youtube Link")
kable(head(df_html)) |>
kable_styling("striped")
id | title | author_s | book_link | youtube_link |
---|---|---|---|---|
1 | Data Feminism | Catherine D’Ignazio and Lauren F. Klein | https://data-feminism.mitpress.mit.edu/ | https://www.youtube.com/watch?v=guIxU_hK4aY&ab_channel=LSE |
2 | Artificial Unintelligence | Meredith Broussard | https://mitpress.mit.edu/9780262537018/artificial-unintelligence// | https://www.youtube.com/watch?v=3PIpCD_hO-g&ab_channel=Triangulation |
3 | Algorithms of Oppression | Safiya Umoja Noble | https://nyupress.org/9781479837243/algorithms-of-oppression/ | https://www.youtube.com/watch?v=UXuJ8yQf6dI&ab_channel=TEDxTalks |
github_link <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_hw7/main/books.json"
temp_file <- tempfile(fileext = ".json")
req <- GET(github_link,
# write result to disk
write_disk(path = temp_file))
json <- fromJSON(txt = temp_file)
df <- tibble(results = json$`Kayleahs Books`)
df_json <- df |>
unnest_wider(results) |> as.data.frame()
df_json <- df_json |>
rename("title"= "Name",
"author_s" = "Author(s)- (and separated)",
"book_link" = "Book Link",
"youtube_link" = "Youtube Link")
df_json$id <- as.integer(df_json$id)
kable(head(df_json)) |>
kable_styling("striped")
id | title | author_s | book_link | youtube_link |
---|---|---|---|---|
1 | Data Feminism | Catherine D’Ignazio and Lauren F. Klein | https://data-feminism.mitpress.mit.edu/ | https://www.youtube.com/watch?v=guIxU_hK4aY&ab_channel=LSE |
2 | Artificial Unintelligence | Meredith Broussard | https://mitpress.mit.edu/9780262537018/artificial-unintelligence// | https://www.youtube.com/watch?v=3PIpCD_hO-g&ab_channel=Triangulation |
3 | Algorithms of Oppression | Safiya Umoja Noble | https://nyupress.org/9781479837243/algorithms-of-oppression/ | https://www.youtube.com/watch?v=UXuJ8yQf6dI&ab_channel=TEDxTalks |
github_link <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_hw7/main/books.xml"
temp_file <- tempfile(fileext = ".xml")
req <- GET(github_link,
# write result to disk
write_disk(path = temp_file))
df_xml <- xmlToDataFrame(temp_file)
df_xml <- df_xml |>
rename("id" = "ID",
"title"= "TITLE",
"author_s" = "AUTHOR_S",
"book_link" = "BOOK_LINK",
"youtube_link" = "YOUTUBE")
df_xml$id <- as.integer(df_xml$id)
kable(head(df_xml)) |>
kable_styling("striped")
id | title | author_s | book_link | youtube_link |
---|---|---|---|---|
1 | Data Feminism | Catherine D’Ignazio and Lauren F. Klein | https://data-feminism.mitpress.mit.edu/ | https://www.youtube.com/watch?v=guIxU_hK4aY&ab_channel=LSE |
2 | Artificial Unintelligence | Meredith Broussard | https://mitpress.mit.edu/9780262537018/artificial-unintelligence// | https://www.youtube.com/watch?v=3PIpCD_hO-g&ab_channel=Triangulation |
3 | Algorithms of Oppression | Safiya Umoja Noble | https://nyupress.org/9781479837243/algorithms-of-oppression/ | https://www.youtube.com/watch?v=UXuJ8yQf6dI&ab_channel=TEDxTalks |
The objective of this assignment was met - a books.html, books.json and books.xml file was created with the title, author(s), purchase or read link and a link to a youtube talk. The dataframes created are the same, regardless of the starting file format.
To build on this current work, I would consider handling the author(s) list differently. Currently if I wanted to separate the authors into individual author elements I could use a regex and separate on the word “and” because that is how I stored the multiple authors list. Instead if I wanted to I could have adapted the html, xml, and json files to have the authors as more of a list format and saved the list as an element of the dataframe.