Setup
library(knitr)
library(kableExtra)
library(prettydoc)
library(rvest)
library(dplyr)
library(jsonlite)
library(RCurl)
library(XML)
remove(list=ls())HTML
First, store the raw HTML file as an R object using the read_html function.
books.html <- read_html("https://raw.githubusercontent.com/aliceafriedman/DATA607_HW_WK7/master/books.html")Then, parse the R object, html, using the html_nodes function from the rvest library.
books.html.table <- html_nodes(books.html, "table")
books.html.table <- html_table(books.html.table[1], fill = TRUE) %>% as.data.frame() %>% glimpse()## Observations: 4
## Variables: 7
## $ Title <chr> "How to be a Straight-A Student", "Deep Work...
## $ Auth1.Last <chr> "Newport", "Newport", "Paul", "Dallas"
## $ Auth1.First <chr> "Cal", "Cal", "Marilyn", "Diamond Dallas"
## $ Auth2.Last <chr> "", "", "", "Aaron"
## $ Auth2.First <chr> "", "", "", "Craig"
## $ Year <int> 2006, 2016, 2003, 2005
## $ Short.Description <chr> "A guide for getting better grades in colleg...
class(books.html.table)## [1] "data.frame"
books.html.table %>%
kable() %>% kable_styling(bootstrap_options = "responsive")| Title | Auth1.Last | Auth1.First | Auth2.Last | Auth2.First | Year | Short.Description |
|---|---|---|---|---|---|---|
| How to be a Straight-A Student | Newport | Cal | 2006 | A guide for getting better grades in college, most of which still applies for grad school, and some of which is useful in professional settings | ||
| Deep Work | Newport | Cal | 2016 | Treatise on the importance of cariving out time for deep concerntaiton, with many helpful, practical tips for improving your productivity at work | ||
| It’s Hard to Make a Difference When You Can’t Find Your Keys | Paul | Marilyn | 2003 | Helpful tips on time management, organization, and making sure you get places on time written by someone who really understands what it’s like not to have those habits and skills. | ||
| Yoga for Regular Guys | Dallas | Diamond Dallas | Aaron | Craig | 2005 | Easy to follow workouts for folks who aren’t too flexible. |
JSON
First, read, the JSON obects into R using the jsonlite pakage. By setting the simplifyVector option to TRUE, this will automatically create a data frame.
Unlike reading from HTML, this will also create a row number.
books.json <- read_json("https://raw.githubusercontent.com/aliceafriedman/DATA607_HW_WK7/master/books.json", simplifyVector = TRUE)
class(books.json)## [1] "data.frame"
books.json %>% glimpse()## Observations: 4
## Variables: 7
## $ Title <chr> "How to Be a Straight-A Student", "Deep Work...
## $ Auth1.Last <chr> "Newport", "Newport", "Paul", "Page"
## $ Auth1.First <chr> "Cal", "Cal", "Marilyn", "Diamond Dallas"
## $ Auth2.Last <chr> "", "", "", "Aaron"
## $ Auth2.First <chr> "", "", "", "Craig"
## $ Year <int> 2006, 2016, 2003, 2005
## $ Short.Description <chr> "A guide for getting better grades in colleg...
books.json %>% kable() %>% kable_styling(bootstrap_options = "responsive")| Title | Auth1.Last | Auth1.First | Auth2.Last | Auth2.First | Year | Short.Description |
|---|---|---|---|---|---|---|
| How to Be a Straight-A Student | Newport | Cal | 2006 | A guide for getting better grades in college, most of which still applies for grad school, and some of which is useful in professional settings | ||
| Deep Work | Newport | Cal | 2016 | Treatise on the importance of cariving out time for deep concerntaiton, with many helpful, practical tips for improving your productivity at work | ||
| It’s Hard to Make a Difference When You Can’t Find Your Keys | Paul | Marilyn | 2003 | Helpful tips on time management, organization, and making sure you get places on time written by someone who really understands what it’s like not to have those habits and skills | ||
| Yoga for Regular Guys | Page | Diamond Dallas | Aaron | Craig | 2005 | Easy to follow workouts for folks who aren’t too flexible |
XML
Use RCurl function getURL to download the XML file from the web, then use the xmlParse function to read the XML file.
books.xml <- getURL("https://raw.githubusercontent.com/aliceafriedman/DATA607_HW_WK7/master/books.xml")
books.xml <- xmlParse(books.xml)
Title <- xpathSApply(books.xml, "//book/title", fun=xmlValue)
Year <- xpathSApply(books.xml, "//book/Year", fun=xmlValue)
Auth1.Last <- xpathSApply(books.xml, "//book/Auth1.Last", fun=xmlValue)
Auth2.Last <- xpathSApply(books.xml, "//book/Auth2.Last", fun=xmlValue)
Auth1.First <- xpathSApply(books.xml, "//book/Auth1.First", fun=xmlValue)
Auth2.First <- xpathSApply(books.xml, "//book/Auth2.First", fun=xmlValue)
Short.Description <- xpathSApply(books.xml, "//book/Short.Description", fun=xmlValue)
books.xml.df <- data.frame(Title, Year, Auth1.Last, Auth1.First, Auth2.Last, Auth2.First, Short.Description)
glimpse(books.xml.df)## Observations: 4
## Variables: 7
## $ Title <fct> How to Be a Straight-A Student, Deep Work, I...
## $ Year <fct> 2006, 2016, 2003, 2005
## $ Auth1.Last <fct> Newport, Newport, Paul, Page
## $ Auth1.First <fct> Cal, Cal, Marilyn, Diamond Dallas
## $ Auth2.Last <fct> , , , Aaron
## $ Auth2.First <fct> , , , Craig
## $ Short.Description <fct> A guide for getting better grades in college...
books.xml.df %>% kable() %>% kable_styling(bootstrap_options = "responsive")| Title | Year | Auth1.Last | Auth1.First | Auth2.Last | Auth2.First | Short.Description |
|---|---|---|---|---|---|---|
| How to Be a Straight-A Student | 2006 | Newport | Cal | A guide for getting better grades in college, most of which still applies for grad school, and some of which is useful in professional settings | ||
| Deep Work | 2016 | Newport | Cal | Treatise on the importance of cariving out time for deep concerntaiton, with many helpful, practical tips for improving your productivity at work | ||
| It’s Hard to Make a Difference When You Can’t Find Your Keys | 2003 | Paul | Marilyn | Helpful tips on time management, organization, and making sure you get places on time written by someone who really understands what it’s like not to have those habits and skills | ||
| Yoga for Regular Guys | 2005 | Page | Diamond Dallas | Aaron | Craig | Helpful tips on time management, organization, and making sure you get places on time written by someone who really understands what it’s like not to have those habits and skills |
Conclusion
Although the kables look identical, the XML-to-R process generates a data frame where each column is saved as a factor. The HTML and JSON files generated a file that correctly identified character and numeric values.