1-2. (HTML) Begin by collecting data from HTML file.
# The HTML file created is stored on github for ease of access.
html.url <- getURL("https://raw.githubusercontent.com/JeremyOBrien16/DATA-607/master/favbooks.html")
data.html <- readHTMLTable(html.url, header = T, as.data.frame = T)
# We give it a visual check.
data.html %>% kable
|
# Attempted to use kable_styling to improve look of resulting tables - strangely, nothing seemed to make any changes.
1-2. (XML) Next, collect data from XML file.
# Ditto, the XML file created is stored on github for ease of access.
xml.url <- getURL("https://raw.githubusercontent.com/JeremyOBrien16/DATA-607/master/favbooks.xml")
file.xml <- xmlParse(file = xml.url)
# We call an XML function to get the data in a dataframe.
df.xml <- xmlToDataFrame(file.xml)
# We give it a visual check.
df.xml %>% kable()
| author | title | genre | publisher_name | price | publication_year | review | pagenums |
|---|---|---|---|---|---|---|---|
| Charles Stross | Palimpsest | Science Fiction | Subterranean | 38.88 | 2011 | Welcome to the Stasis, the clandestine, near-omnipotent organization | 136 |
| Naomi Duguid, Jeffrey Alford | Beyond the Great Wall | Cooking | Artisan | 33.88 | 2008 | Bring home the enticing flavors of the outlying areas of China | 376 |
| Norman Davies | Vanished Kingdoms: The Rise and Fall of States and Nations | History | Viking | 40.00 | 2012 | A dozen-plus examples from European history constitute this ruminative disquisition of the impermanence of polities | 848 |
# Attempted to use kable_styling to improve look of resulting tables - strangely, nothing seemed to make any changes.
1-2. (JSON) Lastly, collect data from JSON file.
# Likewise, the JSON file created is stored on github for ease of access.
json.url <- getURL("https://raw.githubusercontent.com/JeremyOBrien16/DATA-607/master/favbooks.json")
file.json <- (file = json.url)
# We call jsonlite's primary extractive function to get data into R, and then coerce that into a data frame.
data.json <- fromJSON(file.json)
df.json <- as.data.frame(data.json)
# We clean up the column headers.
colnames(df.json) <- str_extract_all(colnames(df.json), "(?<=\\.)[[:alpha:]]+")
# We give it a visual check.
df.json %>% kable()
| author | title | genre | publisher | price | publication | review | pagenums |
|---|---|---|---|---|---|---|---|
| Stross, Charles | Palimpsest | Science Fiction | Subterranean | 38.88 | 2011 | Welcome to the Stasis, the clandestine, near-omnipotent organization | 136 |
| Duguid, Naomi, Alford, Jeffrey | Beyond the Great Wall | Cooking | Artisan | 33.88 | 2008 | Bring home the enticing flavors of the outlying areas of China | 376 |
| Davies, Norman | Vanished Kingdoms: The Rise and Fall of States and Nations | History | Viking | 40.00 | 2012 | A dozen-plus examples from European history constitute this ruminative disquisition of the impermanence of polities | 848 |
# Attempted to use kable_styling to improve look of resulting tables - strangely, nothing seemed to make any changes.
# Examine the structure of each data frame.
str(data.html)
## List of 1
## $ NULL:'data.frame': 3 obs. of 8 variables:
## ..$ author : Factor w/ 3 levels "Charles Stross",..: 1 2 3
## ..$ title : Factor w/ 3 levels "Beyond the Great Wall",..: 2 1 3
## ..$ genre : Factor w/ 3 levels "Cooking","History",..: 3 1 2
## ..$ publisher_name : Factor w/ 3 levels "Artisan","Subterranean",..: 2 1 3
## ..$ price : Factor w/ 3 levels "33.88","38.88",..: 2 1 3
## ..$ publication_year: Factor w/ 3 levels "2008","2011",..: 2 1 3
## ..$ review : Factor w/ 3 levels "A dozen-plus exammples from Europearn history constitute this ruminative disquisition of the impermanence of polities",..: 3 2 1
## ..$ pagenums : Factor w/ 3 levels "136","376","848": 1 2 3
str(df.xml)
## 'data.frame': 3 obs. of 8 variables:
## $ author : Factor w/ 3 levels "Charles Stross",..: 1 2 3
## $ title : Factor w/ 3 levels "Beyond the Great Wall",..: 2 1 3
## $ genre : Factor w/ 3 levels "Cooking","History",..: 3 1 2
## $ publisher_name : Factor w/ 3 levels "Artisan","Subterranean",..: 2 1 3
## $ price : Factor w/ 3 levels "33.88","38.88",..: 2 1 3
## $ publication_year: Factor w/ 3 levels "2008","2011",..: 2 1 3
## $ review : Factor w/ 3 levels "A dozen-plus examples from European history constitute this ruminative disquisition of the impermanence of polities",..: 3 2 1
## $ pagenums : Factor w/ 3 levels "136","376","848": 1 2 3
str(df.json)
## 'data.frame': 3 obs. of 8 variables:
## $ author :List of 3
## ..$ : chr "Stross, Charles"
## ..$ : chr "Duguid, Naomi" "Alford, Jeffrey"
## ..$ : chr "Davies, Norman"
## $ title : chr "Palimpsest" "Beyond the Great Wall" "Vanished Kingdoms: The Rise and Fall of States and Nations"
## $ genre : chr "Science Fiction" "Cooking" "History"
## $ publisher : chr "Subterranean" "Artisan" "Viking"
## $ price : num 38.9 33.9 40
## $ publication: int 2011 2008 2012
## $ review : chr "Welcome to the Stasis, the clandestine, near-omnipotent organization" "Bring home the enticing flavors of the outlying areas of China" "A dozen-plus examples from European history constitute this ruminative disquisition of the impermanence of polities"
## $ pagenums : int 136 376 848
As a next step, we could harmonize each data.frame by coercing variables to the following classes:
(Gave this a shot but it got messy so moved on.)