Data 607 - Week 7 Assignment - Working with HMTL, XML, and JSON in R

Rpubs link: http://rpubs.com/jefflittlejohn/Data_607_Week_7

Github: https://github.com/littlejohnjeff/DATA607_Fall2018/blob/master/Data%20607%20-%20Week%207%20Assignment%20-%20Working%20with%20HTML%2C%20XML%20and%20JSON%20in%20R%20-%20Jeff%20Littlejohn.Rmd

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author.

For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames.

Loading requisite libraries.

library(XML)
library(RCurl)
library(rlist)
library(magrittr)
library(rvest)
library(tidyverse)
library(jsonlite)
library(RJSONIO)

HTML Read the raw html living on Github into a dataframe.

#https://www.rdocumentation.org/packages/textreadr/versions/0.9.0/topics/read_html
html_read_1 <- getURL ("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/books.html") %>% 
  read_html() %>% 
  html_nodes( xpath="//table")

#character vector
#html_read_1

#https://www.rdocumentation.org/packages/rvest/versions/0.3.2/topics/html_table
#convert html into dataframe
df_books <- html_table((html_read_1))[[1]]
df_books

##                                                                    Title
## 1        The Art of R Programming: A Tour of Statistical Software Design
## 2                        R for Everyone: Advanced Analytics and Graphics
## 3 R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
##           Authors           Authors Edition Pages Publication Year
## 1  Norman Matloff                         1   400             2011
## 2 Jared P. Lander                         2   560             2017
## 3  Hadley Wickham Garrett Grolemund       1   522             2017

Looks solid. Datatypes automatically detected.

XML Read XML into a dataframe.

xml_getURL <- getURL ("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/books.xml") 
xml_parsed <- xmlParse(xml_getURL)
xml_root <- xmlRoot(xml_parsed)

df_xml <- xmlToDataFrame(xml_root, stringsAsFactors = FALSE)
df_xml

##                                                                    title
## 1        The Art of R Programming: A Tour of Statistical Software Design
## 2                        R for Everyone: Advanced Analytics and Graphics
## 3 R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
##           author1           author2 edition pages publication_year
## 1  Norman Matloff                         1   400             2011
## 2 Jared P. Lander                         2   560             2017
## 3  Hadley Wickham Garrett Grolemund       1   522             2017

JSON Read JSON file form Github to convert to dataframe

json_getURL <- getURL ("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/books.json")
json_from <- fromJSON(json_getURL)
#struggling to get this to read to separate rows, so having to separate into subsets and before binding
df_json_1 <- as.data.frame(json_from, stringsAsFactors = FALSE)[1,1:6]
df_json_2 <- as.data.frame(json_from, stringsAsFactors = FALSE)[1,7:12]
df_json_3 <- as.data.frame(json_from, stringsAsFactors = FALSE)[1,13:18]
#rename columns to ensure consistency to enable rbind
names(df_json_2) <- names(df_json_1)
names(df_json_3) <- names(df_json_1)
df_json <- rbind(df_json_1,df_json_2,df_json_3)
df_json

##                                                            r_books.title
## 1        The Art of R Programming: A Tour of Statistical Software Design
## 2                        R for Everyone: Advanced Analytics and Graphics
## 3 R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
##    r_books.author1   r_books.author2 r_books.edition r_books.pages
## 1   Norman Matloff                                 1           400
## 2 Jared P. Lander                                  2           560
## 3   Hadley Wickham Garrett Grolemund               1           522
##   r_books.publication_year
## 1                     2011
## 2                     2017
## 3                     2017

Our dataframes look very similar. There’s some slight inconsistency in column names due to characters flaws on the part of the creator. HTML and JSON parsing maintained the data types (numbers did not become strings) in their journeys to become dataframes, but the XML went all strings. It would be a quick fix that probably could haven been handled in the df creation process.

Data 607 - Week 7 Assignment - Working with HMTL, XML, and JSON in R

Jeff Littlejohn

October 13,, 2018