1) Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
2) Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
3)Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
4) Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
library(rvest)
library(tidyr)
library(RCurl)
library(xml2)
library(XML)
library(jsonlite)
library(dplyr)
library(methods)
“Web Scraping”
Data collected for multiple authors from “25 best books by multiple authors” - by ‘Goodread.com’.
Attributes collected 1) Name of book 2) Authors(the page displayed only one author,one being hidden) 3) Average rating received for book 4) Total ratings 5) Goodread recommendation with respect to their shelves placement number.
reads <- read_html("https://www.goodreads.com/shelf/show/multiple-authors")
Book_Name <- reads %>% html_nodes(".bookTitle ") %>% html_text()
Authors<- reads %>% html_nodes(".authorName span ") %>% html_text()
Avg_rating <- reads %>% html_nodes("br+ .greyText.smallText ") %>% html_text()
Goodread_reco <- reads %>% html_nodes("a.smallText") %>% html_text()
data <- data.frame(Book_Name,Authors,Avg_rating,Goodread_reco)
head(data)
## Book_Name
## 1 Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch (Mass Market Paperback)
## 2 Will Grayson, Will Grayson (Hardcover)
## 3 Illuminae (The Illuminae Files, #1)
## 4 Dash & Lily's Book of Dares (Dash & Lily, #1)
## 5 Let it Snow (Kindle Edition)
## 6 Zombies Vs. Unicorns (Hardcover)
## Authors
## 1 Terry Pratchett
## 2 John Green
## 3 Amie Kaufman
## 4 Rachel Cohn
## 5 John Green
## 6 Holly Black
## Avg_rating
## 1 \n avg rating 4.26 —\n 504,411 ratings —\n published 1990\n
## 2 \n avg rating 3.84 —\n 464,083 ratings —\n published 2010\n
## 3 \n avg rating 4.31 —\n 156,786 ratings —\n published 2015\n
## 4 \n avg rating 3.82 —\n 142,368 ratings —\n published 2010\n
## 5 \n avg rating 3.78 —\n 16,610 ratings —\n published 2008\n
## 6 \n avg rating 3.75 —\n 28,273 ratings —\n published 2010\n
## Goodread_reco
## 1 (shelved 7 times as multiple-authors)
## 2 (shelved 6 times as multiple-authors)
## 3 (shelved 5 times as multiple-authors)
## 4 (shelved 5 times as multiple-authors)
## 5 (shelved 5 times as multiple-authors)
## 6 (shelved 5 times as multiple-authors)
The data needs to be converted into tidy dataframe.
Stored the generated data set in CSV file in local working directory.
write.csv(data1,"books.csv")
I) Coding chunk to get the data frame from .html extension.I converted whole data frame with 50 observations in html table.
html_data <- getURL("https://raw.githubusercontent.com/solaojp/CUNY-MSDA/master/books.htm")
parse_html <- htmlParse(html_data)
html_df <- as.data.frame(readHTMLTable(html_data))
colnames(html_df) <- c("Sr.no","Names","Authors","Avg_ratings","Total_ratings","Year","Goodreads_reco")
head(html_df)
## Sr.no
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## Names
## 1 Good Omens: The Nice and Accurate Prophecies\n of Agnes Nutter, Witch (Mass Market Paperback)
## 2 Will Grayson, Will Grayson (Hardcover)
## 3 Illuminae (The Illuminae Files, #1)
## 4 Dash & Lily's Book of Dares (Dash &\n Lily, #1)
## 5 Let it Snow (Kindle Edition)
## 6 Zombies Vs. Unicorns (Hardcover)
## Authors Avg_ratings Total_ratings Year
## 1 Terry Pratchett+1 4.26 504263 1990
## 2 John Green+1 3.84 463967 2010
## 3 Amie Kaufman+1 4.31 156733 2015
## 4 Rachel Cohn+1 3.82 142360 2010
## 5 John Green+1 3.78 16574 2008
## 6 Holly Black+1 3.75 28270 2010
## Goodreads_reco
## 1 (shelved 7 times as multiple-authors)
## 2 (shelved 6 times as multiple-authors)
## 3 (shelved 5 times as multiple-authors)
## 4 (shelved 5 times as multiple-authors)
## 5 (shelved 5 times as multiple-authors)
## 6 (shelved 5 times as multiple-authors)
str(html_df)
## 'data.frame': 50 obs. of 7 variables:
## $ Sr.no : Factor w/ 50 levels "1","10","11",..: 1 12 23 34 45 47 48 49 50 2 ...
## $ Names : Factor w/ 50 levels "A Heartwarming Holiday (Kindle Edition)",..: 9 48 13 6 15 50 5 2 22 24 ...
## $ Authors : Factor w/ 45 levels "A.C. Bextor+1",..: 45 26 3 37 26 21 16 12 29 7 ...
## $ Avg_ratings : Factor w/ 43 levels "3.39","3.47",..: 34 15 36 14 11 9 4 29 17 15 ...
## $ Total_ratings : Factor w/ 50 levels "1154","1183",..: 31 28 6 5 7 17 25 32 40 13 ...
## $ Year : Factor w/ 14 levels "","1990","1997",..: 2 8 12 8 6 8 5 13 14 12 ...
## $ Goodreads_reco: Factor w/ 6 levels "(shelved 2 times as multiple-authors)",..: 6 5 4 4 4 4 4 3 2 2 ...
II) XML conversion
(As file is created manually with textwrangler and tehn pushed to Github repository,only first three entries are taken.But the same code is applicable fro all 50 observations.)
xml_data <- getURL("https://raw.githubusercontent.com/solaojp/CUNY-MSDA/master/libbooks.xml")
xml_data <- xmlParse(xml_data)
xml_data_df <- xmlToDataFrame(xml_data)
xml_data_df
## name
## 1 Good Omens The Nice and Accurate Prophecies of Agnes Nutter
## 2 Will Grayson
## 3 Illuminae The Illuminae Files
## author rating trating year
## 1 Terry Pratchett+1 4.26 504263 1990
## 2 John Green 3.84 463967 2010
## 3 Amie Kaufman+1 4.31 156733 2015
## reco
## 1 shelved 7 times as multiple authors
## 2 shelved 6 times as multiple authors
## 3 shelved 5 times as multiple authors
str(xml_data_df)
## 'data.frame': 3 obs. of 6 variables:
## $ name : Factor w/ 3 levels " Illuminae The Illuminae Files ",..: 2 3 1
## $ author : Factor w/ 3 levels " Amie Kaufman+1 ",..: 3 2 1
## $ rating : Factor w/ 3 levels "3.84 ","4.26 ",..: 2 1 3
## $ trating: Factor w/ 3 levels " 504263 ","156733 ",..: 1 3 2
## $ year : Factor w/ 3 levels "1990 ","2010",..: 1 2 3
## $ reco : Factor w/ 3 levels " shelved 7 times as multiple authors ",..: 1 3 2
III) JSON conversion.
While Creating the .json file manually from the goodreads dataset,whole dataset is taken under single ‘json’ code chunk,hence only one row is generated when imported through Github.But the process to convert dataset in JSON remains similar for all chunks.
json_data <- getURL("https://raw.githubusercontent.com/solaojp/CUNY-MSDA/master/libbooks.json")
parse_json <- fromJSON(json_data)
json_data_df <- as.data.frame(parse_json)
str(json_data_df)
## 'data.frame': 1 obs. of 18 variables:
## $ library.book_1.name : Factor w/ 1 level "Good Omens The Nice and Accurate Prophecies of Agnes Nutter": 1
## $ library.book_1.author : Factor w/ 1 level "Terry Pratchett+1 ": 1
## $ library.book_1.rating : Factor w/ 1 level "4.26": 1
## $ library.book_1.trating: Factor w/ 1 level "504263": 1
## $ library.book_1.year : Factor w/ 1 level "1990": 1
## $ library.book_1.reco : Factor w/ 1 level "shelved 7 times as multiple authors": 1
## $ library.book_2.name : Factor w/ 1 level "Will Grayson": 1
## $ library.book_2.author : Factor w/ 1 level "John Green": 1
## $ library.book_2.rating : Factor w/ 1 level "3.84": 1
## $ library.book_2.trating: Factor w/ 1 level "463967": 1
## $ library.book_2.year : Factor w/ 1 level "2010": 1
## $ library.book_2.reco : Factor w/ 1 level "shelved 6 times as multiple authors": 1
## $ library.book_3.name : Factor w/ 1 level "Illuminae The Illuminae Files": 1
## $ library.book_3.author : Factor w/ 1 level "Amie Kaufman+1": 1
## $ library.book_3.rating : Factor w/ 1 level "4.31": 1
## $ library.book_3.trating: Factor w/ 1 level "156733": 1
## $ library.book_3.year : Factor w/ 1 level "2015 ": 1
## $ library.book_3.reco : Factor w/ 1 level "shelved 5 times as multiple authors": 1
Conclusion
I) Display Format : For the display,all data frames have same appearences.Table format is applicable for all three.
II) Class of variables : All three data frames have varibles with class ‘factors’.The case may differ from dataset to dataset,but considering above data frames,variables are in same class.
Finally,I will conclude that working on all three formats ‘.html’,‘.xml’ and ‘.json’, the observation is - R integration for all three extensions may be different with different packages,but the data frames can be made identical and would give similar analysis.