Setup

library(knitr)
library(kableExtra)
library(prettydoc)
library(rvest)
library(dplyr)
library(jsonlite)
library(RCurl)
library(XML)
remove(list=ls())

HTML

First, store the raw HTML file as an R object using the read_html function.

books.html <- read_html("https://raw.githubusercontent.com/aliceafriedman/DATA607_HW_WK7/master/books.html")

Then, parse the R object, html, using the html_nodes function from the rvest library.

books.html.table <- html_nodes(books.html, "table")
books.html.table <- html_table(books.html.table[1], fill = TRUE) %>% as.data.frame() %>% glimpse()
## Observations: 4
## Variables: 7
## $ Title             <chr> "How to be a Straight-A Student", "Deep Work...
## $ Auth1.Last        <chr> "Newport", "Newport", "Paul", "Dallas"
## $ Auth1.First       <chr> "Cal", "Cal", "Marilyn", "Diamond Dallas"
## $ Auth2.Last        <chr> "", "", "", "Aaron"
## $ Auth2.First       <chr> "", "", "", "Craig"
## $ Year              <int> 2006, 2016, 2003, 2005
## $ Short.Description <chr> "A guide for getting better grades in colleg...
class(books.html.table)
## [1] "data.frame"
books.html.table %>% 
  kable() %>% kable_styling(bootstrap_options = "responsive")
Title Auth1.Last Auth1.First Auth2.Last Auth2.First Year Short.Description
How to be a Straight-A Student Newport Cal 2006 A guide for getting better grades in college, most of which still applies for grad school, and some of which is useful in professional settings
Deep Work Newport Cal 2016 Treatise on the importance of cariving out time for deep concerntaiton, with many helpful, practical tips for improving your productivity at work
It’s Hard to Make a Difference When You Can’t Find Your Keys Paul Marilyn 2003 Helpful tips on time management, organization, and making sure you get places on time written by someone who really understands what it’s like not to have those habits and skills.
Yoga for Regular Guys Dallas Diamond Dallas Aaron Craig 2005 Easy to follow workouts for folks who aren’t too flexible.

JSON

First, read, the JSON obects into R using the jsonlite pakage. By setting the simplifyVector option to TRUE, this will automatically create a data frame.

Unlike reading from HTML, this will also create a row number.

books.json <- read_json("https://raw.githubusercontent.com/aliceafriedman/DATA607_HW_WK7/master/books.json",  simplifyVector = TRUE)
class(books.json)
## [1] "data.frame"
books.json %>% glimpse()
## Observations: 4
## Variables: 7
## $ Title             <chr> "How to Be a Straight-A Student", "Deep Work...
## $ Auth1.Last        <chr> "Newport", "Newport", "Paul", "Page"
## $ Auth1.First       <chr> "Cal", "Cal", "Marilyn", "Diamond Dallas"
## $ Auth2.Last        <chr> "", "", "", "Aaron"
## $ Auth2.First       <chr> "", "", "", "Craig"
## $ Year              <int> 2006, 2016, 2003, 2005
## $ Short.Description <chr> "A guide for getting better grades in colleg...
books.json %>%  kable() %>% kable_styling(bootstrap_options = "responsive")
Title Auth1.Last Auth1.First Auth2.Last Auth2.First Year Short.Description
How to Be a Straight-A Student Newport Cal 2006 A guide for getting better grades in college, most of which still applies for grad school, and some of which is useful in professional settings
Deep Work Newport Cal 2016 Treatise on the importance of cariving out time for deep concerntaiton, with many helpful, practical tips for improving your productivity at work
It’s Hard to Make a Difference When You Can’t Find Your Keys Paul Marilyn 2003 Helpful tips on time management, organization, and making sure you get places on time written by someone who really understands what it’s like not to have those habits and skills
Yoga for Regular Guys Page Diamond Dallas Aaron Craig 2005 Easy to follow workouts for folks who aren’t too flexible

XML

Use RCurl function getURL to download the XML file from the web, then use the xmlParse function to read the XML file.

books.xml <- getURL("https://raw.githubusercontent.com/aliceafriedman/DATA607_HW_WK7/master/books.xml")
books.xml <- xmlParse(books.xml)
Title <- xpathSApply(books.xml, "//book/title", fun=xmlValue)
Year <- xpathSApply(books.xml, "//book/Year", fun=xmlValue)
Auth1.Last <- xpathSApply(books.xml, "//book/Auth1.Last", fun=xmlValue)
Auth2.Last <- xpathSApply(books.xml, "//book/Auth2.Last", fun=xmlValue)
Auth1.First <- xpathSApply(books.xml, "//book/Auth1.First", fun=xmlValue)
Auth2.First <- xpathSApply(books.xml, "//book/Auth2.First", fun=xmlValue)
Short.Description <- xpathSApply(books.xml, "//book/Short.Description", fun=xmlValue)
books.xml.df <- data.frame(Title, Year, Auth1.Last, Auth1.First, Auth2.Last, Auth2.First, Short.Description)
glimpse(books.xml.df)
## Observations: 4
## Variables: 7
## $ Title             <fct> How to Be a Straight-A Student, Deep Work, I...
## $ Year              <fct> 2006, 2016, 2003, 2005
## $ Auth1.Last        <fct> Newport, Newport, Paul, Page
## $ Auth1.First       <fct> Cal, Cal, Marilyn, Diamond Dallas
## $ Auth2.Last        <fct> , , , Aaron
## $ Auth2.First       <fct> , , , Craig
## $ Short.Description <fct> A guide for getting better grades in college...
books.xml.df %>% kable() %>% kable_styling(bootstrap_options = "responsive")
Title Year Auth1.Last Auth1.First Auth2.Last Auth2.First Short.Description
How to Be a Straight-A Student 2006 Newport Cal A guide for getting better grades in college, most of which still applies for grad school, and some of which is useful in professional settings
Deep Work 2016 Newport Cal Treatise on the importance of cariving out time for deep concerntaiton, with many helpful, practical tips for improving your productivity at work
It’s Hard to Make a Difference When You Can’t Find Your Keys 2003 Paul Marilyn Helpful tips on time management, organization, and making sure you get places on time written by someone who really understands what it’s like not to have those habits and skills
Yoga for Regular Guys 2005 Page Diamond Dallas Aaron Craig Helpful tips on time management, organization, and making sure you get places on time written by someone who really understands what it’s like not to have those habits and skills

Conclusion

Although the kables look identical, the XML-to-R process generates a data frame where each column is saved as a factor. The HTML and JSON files generated a file that correctly identified character and numeric values.