##Intro
In the following codes we will create a table of books in HTML, XML, and JSON, with a goal of mind to pull and create dataframes in R. Loading out packages are in the next code:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(XML)
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 4.3.3
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
First I created the HTML table.
table_HTML <- read_html("https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/Assignment%207%20HTML%20table")
Second we load into a dataframe in R using the following code.
Books_table_HTML <- table_HTML %>%
html_node("table") %>%
html_table(header = TRUE, fill = TRUE)
Books_table_HTML
## # A tibble: 3 × 6
## Title Authors `Publication Year` Summary Genre `Main Themes`
## <chr> <chr> <int> <chr> <chr> <chr>
## 1 Guns, Germs, and Steel Jared … 1997 Explor… Non-… Geography, A…
## 2 A People's History of … Howard… 1980 Provid… Non-… Social Justi…
## 3 The Rise and Fall of t… Willia… 1960 Chroni… Non-… World War II…
##XML same as we did in the first part, we store out XML code in Github and read it on R
library(xml2)
library(xmlconvert)
## Warning: package 'xmlconvert' was built under R version 4.3.3
url <- "https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/Books%20.XML"
xmltable <- (read_xml(url))
xmltable
## {xml_document}
## <books>
## [1] <book>\n <title>Guns, Germs, and Steel</title>\n <authors>Jared Diamond ...
## [2] <book>\n <title>A People's History of the United States</title>\n <auth ...
## [3] <book>\n <title>The Rise and Fall of the Third Reich</title>\n <authors ...
xml_data <- xmlParse(xmltable)
titles <- xpathSApply(xml_data, "//book/title", xmlValue)
authors <- xpathSApply(xml_data, "//book/authors", xmlValue)
years <- xpathSApply(xml_data, "//book/publication_year", xmlValue)
summaries <- xpathSApply(xml_data, "//book/summary", xmlValue)
main_themes <- xpathSApply(xml_data, "//book/main_themes", xmlValue)
genres <- xpathSApply(xml_data, "//book/genre", xmlValue)
Books_data_XML <- data.frame(
Title = titles,
Authors = authors,
Publication_Year = years,
Summary = summaries,
Genre = genres,
Main_Themes = main_themes)
print(Books_data_XML)
## Title Authors
## 1 Guns, Germs, and Steel Jared Diamond
## 2 A People's History of the United States Howard Zinn
## 3 The Rise and Fall of the Third Reich William L. Shirer, Ron Rosenbaum
## Publication_Year
## 1 1997
## 2 1980
## 3 1960
## Summary
## 1 Explores the reasons for the dominance of Eurasian civilizations throughout history.
## 2 Provides a different perspective on American history, focusing on the experiences of ordinary people rather than political elites.
## 3 Chronicles the history of Nazi Germany, from its rise to power to its defeat in World War II, written by a journalist who witnessed many of the events firsthand.
## Genre Main_Themes
## 1 Non-fiction Geography, Anthropology, Sociology
## 2 Non-fiction Social Justice, Labor Movements, Civil Rights
## 3 Non-fiction World War II, Totalitarianism, Rise and Fall of Empires
##JSON
Json_file = "https://raw.githubusercontent.com/sokkarbishoy/DATA607/main/books.JSON"
json_raw <- fromJSON(Json_file)
Books_table_json <- as.data.frame(json_raw, col.names = c(""))
Books_table_json
## title authors
## 1 Guns, Germs, and Steel Jared Diamond
## 2 A People's History of the United States Howard Zinn
## 3 The Rise and Fall of the Third Reich William L. Shirer, Ron Rosenbaum
## publication_year
## 1 1997
## 2 1980
## 3 1960
## summary
## 1 Explores the reasons for the dominance of Eurasian civilizations throughout history.
## 2 Provides a different perspective on American history, focusing on the experiences of ordinary people rather than political elites.
## 3 Chronicles the history of Nazi Germany, from its rise to power to its defeat in World War II, written by a journalist who witnessed many of the events firsthand.
## genre main_themes
## 1 Non-fiction Geography, Anthropology, Sociology
## 2 Non-fiction Social Justice, Labor Movements, Civil Rights
## 3 Non-fiction World War II, Totalitarianism, Rise and Fall of Empires
##Conclusion In this assignment, I learned how to pull data from HTML, XML, and JSON code formate into R. I am sure this was simple as I was the one who created the table but over the next weeks I will be practicing web scrapping more and would apply what I learned from this assignment.