GitHub Link: https://github.com/Peter-Thompson1992/Data607/blob/main/Assignment7.Rmd RPubs Link:
##Instructions Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
Sources: https://rvest.tidyverse.org/ https://www.freecodecamp.org/news/introduction-to-html/1 https://community.splunk.com/t5/Splunk-Search/How-to-create-a-table-from-JSON/m-p/642198 https://stackoverflow.com/questions/5863304/how-should-i-represent-tabular-data-in-json https://tomizonor.wordpress.com/2013/03/26/from-html-pages/ – this didn’t seem to work for me
##I thought it was easiest to first put all of the data into a csv/excel file so I would know exactly what I expected to see. That way I could more easily check the information on the files was accurate before finishing the rest of the assignment.
First we will import the file in HTML format the rvest TidyVerse package is really awesome for getting website data. I would highly recommend everyone checks out the package notation I have cited here. It can be used to scrape data from far more complex sites than a raw html.
https://cran.r-project.org/web/packages/rvest/rvest.pdf https://www.datacamp.com/tutorial/r-web-scraping-rvest https://stackoverflow.com/questions/77790604/new-to-web-scraping-in-r-how-to-use-the-rvest-package-to-scrape-imdb-movie-dat
https://github.com/Peter-Thompson1992/Data607/blob/main/books.html
html_version <- "https://raw.githubusercontent.com/Peter-Thompson1992/Data607/main/books.html"
html_df <- read_html(html_version)
print(html_df)
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body><table>\n<tr>\n<th>Title</th>\r\n <th>Author</th>\r\n ...
books_df <- html_df %>%
html_table(fill = TRUE)
books_df <- books_df[[1]]
print(books_df)
## # A tibble: 3 × 5
## Title Author Genre `Year Published` Awards
## <chr> <chr> <chr> <int> <chr>
## 1 A Confederacy of Dunces John Kennedy Toole Fict… 1980 Pulit…
## 2 Heart of Darkness Joseph Conrad Fict… 1899 None
## 3 Good Omens Terry Pratchett, Neil G… Fant… 1990 World…
Next we will import from the json format https://github.com/Peter-Thompson1992/Data607/blob/main/books.json
This source suggested using jsonlite: https://www.computerworld.com/article/2921176/great-r-packages-for-data-import-wrangling-visualization.html https://www.opencpu.org/posts/jsonlite-a-smarter-json-encoder/
json_version <- "https://raw.githubusercontent.com/Peter-Thompson1992/Data607/main/books.json"
json_df <- fromJSON(json_version)
print(json_df)
## Title Author Genre Year Published
## 1 A Confederacy of Dunces John Kennedy Toole Fiction 1980
## 2 Heart of Darkness Joseph Conrad Fiction 1899
## 3 Good Omens Terry Pratchett, Neil Gaiman Fantasy 1990
## Awards
## 1 Pulitzer Prize 1981
## 2 None
## 3 World Fantasy Award nominee for Best Novel, 1991
Next we will import from the xml format https://github.com/Peter-Thompson1992/Data607/blob/main/books.xml
https://www.reddit.com/r/Rlanguage/comments/n16b6y/cant_read_an_xml_file_with_r/
url <- "https://raw.githubusercontent.com/Peter-Thompson1992/Data607/main/books3.xml"
xml_data <- read_xml(url)
xml_data2 <- xmlParse(xml_data)
## Error in xmlParse(xml_data): could not find function "xmlParse"
books_xml_df <- xmlToDataFrame(xml_data2)
## Error in xmlToDataFrame(xml_data2): could not find function "xmlToDataFrame"
###Conclusion
In conclusion the dataframes look exactly the same, with one slight difference. It seems to be how html works that the initial dataframe pulled includes everything. We have to specify that we want table 1 from that site. HTML certainly seems to be the most powerful as that is how web pages are presented. It allows for taking data from essentially everywhere on the web,