Task:

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Libraries:

library(RJSONIO)
library(knitr)
library(RCurl)
library(tidyverse)
library(XML)

Reading in HTML data into R from Github

html_url <- getURL("https://raw.githubusercontent.com/mandiemannz/Data-607--Fall-18/master/Bookshtml.html")

#read html table

html_data <- html_url%>%
  readHTMLTable() %>%
  data.frame()



head(html_data)

##                                                                   NULL.Title
## 1 Python Crash Course: A Hands-On, Project-Based Introduction to Programming
## 2     R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
## 3                                                    Machine Learning with R
##                            NULL.Author NULL.Cover.Type NULL.Subject
## 1                         Eric Matthes       Paperback  Programming
## 2 Hadley Wickham and Garrett Grolemund       Paperback  Programming
## 3                          Brett Lantz       Paperback  Programming
##   NULL.Pages
## 1        525
## 2        492
## 3        424

colnames(html_data) <- str_replace(colnames(html_data),"NULL\\.", "")
colnames(html_data) <- str_replace(colnames(html_data),"\\.", " ")

kable(html_data)

Title	Author	Cover Type	Subject	Pages
Python Crash Course: A Hands-On, Project-Based Introduction to Programming	Eric Matthes	Paperback	Programming	525
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	Hadley Wickham and Garrett Grolemund	Paperback	Programming	492
Machine Learning with R	Brett Lantz	Paperback	Programming	424

Reading XML data into R from Github

xml_url<-getURL("https://raw.githubusercontent.com/mandiemannz/Data-607--Fall-18/master/booksxml.xml")

xml_data <- xml_url %>%
  xmlParse() %>%
  xmlToDataFrame()

kable(xml_data)

title	author	pages	category	cover_type
Python Crash Course: A Hands-On, Project-Based Introduction to Programming	Eric Matthes	525	Programming	paperback
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	Hadley Wickham and Garrett Grolemund	492	Programming	paperback
Machine Learning with R	Brett Lantz	424	Programming	paperback

Reading JSON data into R from Github

json_data <- getURLContent("https://raw.githubusercontent.com/mandiemannz/Data-607--Fall-18/master/json")



json_data_frame <- fromJSON(json_data)
json_data_frame <- do.call("rbind", lapply(json_data_frame$'books', data.frame, stringsAsFactors = FALSE))


kable(json_data_frame)

	c.title….Python.Crash.Course..A.Hands.On..Project.Based.Introduction.to.Programming…	c.title….R.for.Data.Science..Import..Tidy..Transform..Visualize..and.Model.Data…	c.title….Machine.Learning.with.R…author….Brett.Lantz…
book.title	Python Crash Course: A Hands-On, Project-Based Introduction to Programming	R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	Machine Learning with R
book.author	Eric Matthes	Hadley Wickham and Garrett Grolemund	Brett Lantz
book.pages	525	492	424
book.category	Programming	Programming	Programming
book.cover_type	paperback	paperback	paperback

Conclusion:

The HTML and XML data frames were identical, and the JSON data frame was slightly off. The JSON format separates each book into three different columns.

Working with XML and JSON in R

Amanda Arce

October 11, 2018

Task:

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Reading in HTML data into R from Github

Reading XML data into R from Github

Reading JSON data into R from Github

Conclusion: