Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
library(tidyverse)## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
library(openintro)## Warning: package 'openintro' was built under R version 4.2.3
library(dplyr)
library(data.table)## Warning: package 'data.table' was built under R version 4.2.3
library(XML)## Warning: package 'XML' was built under R version 4.2.3
library(rvest)
library(DT)## Warning: package 'DT' was built under R version 4.2.3
library(RCurl)## Warning: package 'RCurl' was built under R version 4.2.3
library(jsonlite)## Warning: package 'jsonlite' was built under R version 4.2.3
bookshtml <- readLines("https://raw.githubusercontent.com/lburenkov/607week7/main/Book607.html")
bookshtml## [1] "Book\tAuthors\tSubject\tPublication_date\tLanguage"
## [2] "Why Nations Fail\tDaron Acemoglu, James Robinson\tEconomy\t2012\tEnglish"
## [3] "Pride and Prejudice\tJane Austen\tNovel\t1813\tEnglish"
## [4] "La borra del café\tMario Benedetti\tNovel\t1992\tEspañol"
books.html.url <-getURL("https://raw.githubusercontent.com/lburenkov/607week7/main/Book607.html")books.html.url## [1] "Book\tAuthors\tSubject\tPublication_date\tLanguage\r\nWhy Nations Fail\tDaron Acemoglu, James Robinson\tEconomy\t2012\tEnglish\r\nPride and Prejudice\tJane Austen\tNovel\t1813\tEnglish\r\nLa borra del café\tMario Benedetti\tNovel\t1992\tEspañol\r\n"
records2 <- strsplit(books.html.url, "\r\n")[[1]]
records2## [1] "Book\tAuthors\tSubject\tPublication_date\tLanguage"
## [2] "Why Nations Fail\tDaron Acemoglu, James Robinson\tEconomy\t2012\tEnglish"
## [3] "Pride and Prejudice\tJane Austen\tNovel\t1813\tEnglish"
## [4] "La borra del café\tMario Benedetti\tNovel\t1992\tEspañol"
# Create an empty data frame to store the structured data
structured_data2 <- data.frame(
Book = character(0),
Authors = character(0),
Subject = character(0),
Publication_date = character(0),
Language = character(0)
)
# Loop through each record
for (record in records2) {
# Split each record using "\t" (tab) as the delimiter
fields <- unlist(strsplit(record, "\t"))
# Extract and organize the data into the data frame
structured_data2 <- rbind(structured_data2, data.frame(
Book = fields[1],
Authors = fields[2],
Subject = fields[3],
Publication_date = fields[4],
Language = fields[5]
))
}
# Print the structured data
print(structured_data2)## Book Authors Subject Publication_date
## 1 Book Authors Subject Publication_date
## 2 Why Nations Fail Daron Acemoglu, James Robinson Economy 2012
## 3 Pride and Prejudice Jane Austen Novel 1813
## 4 La borra del café Mario Benedetti Novel 1992
## Language
## 1 Language
## 2 English
## 3 English
## 4 Español
data.frame(structured_data2)## Book Authors Subject Publication_date
## 1 Book Authors Subject Publication_date
## 2 Why Nations Fail Daron Acemoglu, James Robinson Economy 2012
## 3 Pride and Prejudice Jane Austen Novel 1813
## 4 La borra del café Mario Benedetti Novel 1992
## Language
## 1 Language
## 2 English
## 3 English
## 4 Español
books.xml.url <- getURL('https://raw.githubusercontent.com/lburenkov/607week7/main/Book607.xml')
books.xml.url## [1] "Book\tAuthors\tSubject\tPublication_date\tLanguage\r\nWhy Nations Fail\tDaron Acemoglu, James Robinson\tEconomy\t2012\tEnglish\r\nPride and Prejudice\tJane Austen\tNovel\t1813\tEnglish\r\nLa borra del café\tMario Benedetti\tNovel\t1992\tEspañol\r\n"
records1 <- strsplit(books.xml.url, "\r\n")[[1]]
records1## [1] "Book\tAuthors\tSubject\tPublication_date\tLanguage"
## [2] "Why Nations Fail\tDaron Acemoglu, James Robinson\tEconomy\t2012\tEnglish"
## [3] "Pride and Prejudice\tJane Austen\tNovel\t1813\tEnglish"
## [4] "La borra del café\tMario Benedetti\tNovel\t1992\tEspañol"
# Create an empty data frame to store the structured data
structured_data1 <- data.frame(
Book = character(0),
Authors = character(0),
Subject = character(0),
Publication_date = character(0),
Language = character(0)
)
# Remove the header row
records1 <- records1[-1]
# Loop through each record
for (record in records1) {
# Split each record using "\t" as a delimiter
fields <- unlist(strsplit(record, "\t"))
# Extract and organize the data into the data frame
structured_data1 <- rbind(structured_data1, data.frame(
Book = fields[1],
Authors = fields[2],
Subject = fields[3],
Publication_date = fields[4],
Language = fields[5]
))
}
# Print the structured data
print(structured_data1)## Book Authors Subject Publication_date
## 1 Why Nations Fail Daron Acemoglu, James Robinson Economy 2012
## 2 Pride and Prejudice Jane Austen Novel 1813
## 3 La borra del café Mario Benedetti Novel 1992
## Language
## 1 English
## 2 English
## 3 Español
data.frame(structured_data1)## Book Authors Subject Publication_date
## 1 Why Nations Fail Daron Acemoglu, James Robinson Economy 2012
## 2 Pride and Prejudice Jane Austen Novel 1813
## 3 La borra del café Mario Benedetti Novel 1992
## Language
## 1 English
## 2 English
## 3 Español
books.json.url <- getURL("https://raw.githubusercontent.com/lburenkov/607week7/main/books607.json")
books.json.url## [1] "Json\r\n\"Why Nations Fail\":\"Daron Acemoglu, James Robinson\":\"Economy\":\"2012\":\"English\",\r\n\"Pride and Prejudice\":\"Jane Austen\":\"Novel\":\"1813\":\"English\",\r\n\"La borra del café\":\"Mario Benedetti\":\"Novel\":\"1992\":\"Español\",\r\n"
head(books.json.url)## [1] "Json\r\n\"Why Nations Fail\":\"Daron Acemoglu, James Robinson\":\"Economy\":\"2012\":\"English\",\r\n\"Pride and Prejudice\":\"Jane Austen\":\"Novel\":\"1813\":\"English\",\r\n\"La borra del café\":\"Mario Benedetti\":\"Novel\":\"1992\":\"Español\",\r\n"
records <- strsplit(books.json.url, "\r\n")[[1]]records## [1] "Json"
## [2] "\"Why Nations Fail\":\"Daron Acemoglu, James Robinson\":\"Economy\":\"2012\":\"English\","
## [3] "\"Pride and Prejudice\":\"Jane Austen\":\"Novel\":\"1813\":\"English\","
## [4] "\"La borra del café\":\"Mario Benedetti\":\"Novel\":\"1992\":\"Español\","
# Create an empty data frame to store the structured data
structured_data <- data.frame(
Title = character(0),
Author = character(0),
Genre = character(0),
Year = character(0),
Language = character(0)
)
# Loop through each record
for (record in records) {
# Split each record using ":" as a delimiter
fields <- unlist(strsplit(record, ":"))
# Extract and organize the data into the data frame
structured_data <- rbind(structured_data, data.frame(
Title = fields[1],
Author = fields[2],
Genre = fields[3],
Year = fields[4],
Language = fields[5]
))
}
# Print the structured data
print(structured_data)## Title Author Genre Year
## 1 Json <NA> <NA> <NA>
## 2 "Why Nations Fail" "Daron Acemoglu, James Robinson" "Economy" "2012"
## 3 "Pride and Prejudice" "Jane Austen" "Novel" "1813"
## 4 "La borra del café" "Mario Benedetti" "Novel" "1992"
## Language
## 1 <NA>
## 2 "English",
## 3 "English",
## 4 "Español",
data.frame(structured_data)## Title Author Genre Year
## 1 Json <NA> <NA> <NA>
## 2 "Why Nations Fail" "Daron Acemoglu, James Robinson" "Economy" "2012"
## 3 "Pride and Prejudice" "Jane Austen" "Novel" "1813"
## 4 "La borra del café" "Mario Benedetti" "Novel" "1992"
## Language
## 1 <NA>
## 2 "English",
## 3 "English",
## 4 "Español",
Processes were very different when reading this files. The data frames look different but I think its because of files were read and how I have been working with them. Certainly data frames need more work on some files than the others which I believe it is the main point of this exercise.