Rpubs link: http://rpubs.com/jefflittlejohn/Data_607_Week_7
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author.
For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames.
Loading requisite libraries.
library(XML)
library(RCurl)
library(rlist)
library(magrittr)
library(rvest)
library(tidyverse)
library(jsonlite)
library(RJSONIO)
HTML Read the raw html living on Github into a dataframe.
#https://www.rdocumentation.org/packages/textreadr/versions/0.9.0/topics/read_html
html_read_1 <- getURL ("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/books.html") %>%
read_html() %>%
html_nodes( xpath="//table")
#character vector
#html_read_1
#https://www.rdocumentation.org/packages/rvest/versions/0.3.2/topics/html_table
#convert html into dataframe
df_books <- html_table((html_read_1))[[1]]
df_books
## Title
## 1 The Art of R Programming: A Tour of Statistical Software Design
## 2 R for Everyone: Advanced Analytics and Graphics
## 3 R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
## Authors Authors Edition Pages Publication Year
## 1 Norman Matloff 1 400 2011
## 2 Jared P. Lander 2 560 2017
## 3 Hadley Wickham Garrett Grolemund 1 522 2017
Looks solid. Datatypes automatically detected.
XML Read XML into a dataframe.
xml_getURL <- getURL ("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/books.xml")
xml_parsed <- xmlParse(xml_getURL)
xml_root <- xmlRoot(xml_parsed)
df_xml <- xmlToDataFrame(xml_root, stringsAsFactors = FALSE)
df_xml
## title
## 1 The Art of R Programming: A Tour of Statistical Software Design
## 2 R for Everyone: Advanced Analytics and Graphics
## 3 R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
## author1 author2 edition pages publication_year
## 1 Norman Matloff 1 400 2011
## 2 Jared P. Lander 2 560 2017
## 3 Hadley Wickham Garrett Grolemund 1 522 2017
JSON Read JSON file form Github to convert to dataframe
json_getURL <- getURL ("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/books.json")
json_from <- fromJSON(json_getURL)
#struggling to get this to read to separate rows, so having to separate into subsets and before binding
df_json_1 <- as.data.frame(json_from, stringsAsFactors = FALSE)[1,1:6]
df_json_2 <- as.data.frame(json_from, stringsAsFactors = FALSE)[1,7:12]
df_json_3 <- as.data.frame(json_from, stringsAsFactors = FALSE)[1,13:18]
#rename columns to ensure consistency to enable rbind
names(df_json_2) <- names(df_json_1)
names(df_json_3) <- names(df_json_1)
df_json <- rbind(df_json_1,df_json_2,df_json_3)
df_json
## r_books.title
## 1 The Art of R Programming: A Tour of Statistical Software Design
## 2 R for Everyone: Advanced Analytics and Graphics
## 3 R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
## r_books.author1 r_books.author2 r_books.edition r_books.pages
## 1 Norman Matloff 1 400
## 2 Jared P. Lander 2 560
## 3 Hadley Wickham Garrett Grolemund 1 522
## r_books.publication_year
## 1 2011
## 2 2017
## 3 2017
Our dataframes look very similar. There’s some slight inconsistency in column names due to characters flaws on the part of the creator. HTML and JSON parsing maintained the data types (numbers did not become strings) in their journeys to become dataframes, but the XML went all strings. It would be a quick fix that probably could haven been handled in the df creation process.